Pandas Google BigQuery

Last update: Dec 28, 2022

Related tags

Overview

pandas-gbq

pandas-gbq is a package providing an interface to the Google BigQuery API from pandas

Installation

Install latest release version via conda

$ conda install pandas-gbq --channel conda-forge

Install latest release version via pip

$ pip install pandas-gbq

Install latest development version

$ pip install git+https://github.com/pydata/pandas-gbq.git

Usage

See the pandas-gbq documentation for more details.

Comments

ENH: Convert read_gbq() function to use google-cloud-python
Description

I've rewritten the current read_gbq() function using google-cloud-python, which handles the naming of structs and arrays out of the box. For more discussion about this, see: https://github.com/pydata/pandas-gbq/issues/23.

~However, because of the fact that google-cloud-python potentially uses different authentication flows and may break existing behavior, I've left the existing read_gbq() function and and named this new function from_gbq(). If in the future we are able to reconcile the authentication flows and/or decide to deprecate flows that are not supported in google-cloud-python, we can rename this to read_gbq().~

UPDATE: As requested in comment by @jreback (https://github.com/pydata/pandas-gbq/pull/25/files/a763cf071813c836b7e00ae40ccf14e93e8fd72b#r110518161), I deleted old read_gbq() and named my new function read_gbq(), deleting all legacy functions and code.

Added in a few lines to requirements file, but I'll leave it to you @jreback to deal with conda dependency issues which you mentioned in Issue 23.

Let know if any questions or if any tests need to be written. You can confirm that it works by running the following:

q = """ select ROW_NUMBER() over () row_num, struct(a,b) col, c, d, c*d c_times_d, e from (select * from (SELECT 1 a, 2 b, null c, 0 d, 100 e) UNION ALL (SELECT 5 a, 6 b, 0 c, null d, 200 e) UNION ALL (SELECT 8 a, 9 b, 10.0 c, 10 d, 300 e) ) """ df = gbq.read_gbq(q, dialect='standard') df

| row_num | col | c | d | c_times_d | e | |---------|--------------------|------|------|-----------|-----| | 2 | {u'a': 5, u'b': 6} | 0.0 | NaN | NaN | 200 | | 1 | {u'a': 1, u'b': 2} | NaN | 0.0 | NaN | 100 | | 3 | {u'a': 8, u'b': 9} | 10.0 | 10.0 | 100.0 | 300 |

q = """ select array_agg(a) mylist from (select "1" a UNION ALL select "2" a) """ df = gbq.read_gbq(q, dialect='standard') df

| mylist | |--------| | [1, 2] |

q = """ select array_agg(struct(a,b)) col, f from (select * from (SELECT 1 a, 2 b, null c, 0 d, 100 e, "hello" f) UNION ALL (SELECT 5 a, 6 b, 0 c, null d, 200 e, "ok" f) UNION ALL (SELECT 8 a, 9 b, 10.0 c, 10 d, 300 e, "ok" f) ) group by f """ df = gbq.read_gbq(q, dialect='standard') df

| col | f | |------------------------------------------|-------| | [{u'a': 5, u'b': 6}, {u'a': 8, u'b': 9}] | ok | | [{u'a': 1, u'b': 2}] | hello |

Confirmed that col_order and index_col still work ~(feel free to pull that out into a separate function since there's now redundant code with read_gbq())~, and I removed the type conversion lines which appear to be unnecessary (google-cloud-python and/or pandas appears to do the necessary type conversion automatically, even if there are nulls; can confirm by examining the datatypes in the resulting dataframes).
type: feature request
opened by jasonqng 55
Performance
We're starting to use BigQuery heavily but becoming increasingly 'bottlenecked' with the performance of moving moderate amounts of data from BigQuery to python.

Here's a few stats:

29.1s: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq

36.5s: Pulling the same query with google-cloud-bigquery - i.e. client.query(query)..to_dataframe()

2.4s: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS (using pd.read_sql). That's on standard drivers, nothing like turbodbc involved

...so using BigQuery with python is at least an order of magnitude slower than traditional DBs.

We've tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.

A few questions - feel free to jump in with partial replies:

Are these results expected, or are we doing something very wrong?

My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should pandas-gbq invest resources into getting a format that's query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)
opened by max-sixty 45
BUG: oauth2client deprecated, use google-auth instead.

Remove the use of oauth2client and use google-auth library, instead.

Rather than check for multiple versions of the libraries, use the setup.py to specify compatible versions. I believe this is safe since Pandas checks for the pandas_gbq package.

Since google-auth does not use the argparse module to override user authentication flow settings, add a parameter to choose between the web and console flow.

Closes https://github.com/pydata/pandas-gbq/issues/37.
type: bug

opened by tswast 41
to_gbq result in UnicodeEncodeError

Hi, I'm using Heroku to run a python based ETL process where I'm pushing the contents of a Pandas dataframe into Google BQ using to_gbq. However, it's generating a UnicodeEncodeError with the following stack trace, due to some non-latin characters.

Strangely this works fine on my Mac but when I try to run it on Heroku, it's failing. It seems that for some reason, http.client.py is getting an un-encoded string rather than bytes and therefore, it's trying to encode with latin-1, which is the default but obviously would choke on anything non-latin, like Chinese chars.

2018-01-08T04:54:17.307496+00:00 app[run.2251]: Load is 100.0% Complete044+00:00 app[run.2251]: 2018-01-08T04:54:20.443238+00:00 app[run.2251]: Traceback (most recent call last): 2018-01-08T04:54:20.443267+00:00 app[run.2251]: File "AllCostAndRev.py", line 534, in 2018-01-08T04:54:20.443708+00:00 app[run.2251]: main(yaml.dump(data=ads_dict)) 2018-01-08T04:54:20.443710+00:00 app[run.2251]: File "AllCostAndRev.py", line 475, in main 2018-01-08T04:54:20.443915+00:00 app[run.2251]: private_key=environ['skynet_bq_pk'] 2018-01-08T04:54:20.443917+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 989, in to_gbq 2018-01-08T04:54:20.444390+00:00 app[run.2251]: connector.load_data(dataframe, dataset_id, table_id, chunksize) 2018-01-08T04:54:20.444391+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 590, in load_data 2018-01-08T04:54:20.444653+00:00 app[run.2251]: job_config=job_config).result() 2018-01-08T04:54:20.444656+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 748, in load_table_from_file 2018-01-08T04:54:20.445248+00:00 app[run.2251]: response = upload.transmit_next_chunk(transport) 2018-01-08T04:54:20.445250+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/upload.py", line 395, in transmit_next_chunk 2018-01-08T04:54:20.444942+00:00 app[run.2251]: file_obj, job_resource, num_retries) 2018-01-08T04:54:20.445457+00:00 app[run.2251]: retry_strategy=self._retry_strategy) 2018-01-08T04:54:20.444943+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 777, in _do_resumable_upload 2018-01-08T04:54:20.445458+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/_helpers.py", line 101, in http_request 2018-01-08T04:54:20.445592+00:00 app[run.2251]: func, RequestsMixin._get_status_code, retry_strategy) 2018-01-08T04:54:20.445594+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/_helpers.py", line 146, in wait_and_retry 2018-01-08T04:54:20.445725+00:00 app[run.2251]: response = func() 2018-01-08T04:54:20.445726+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/auth/transport/requests.py", line 186, in request 2018-01-08T04:54:20.445866+00:00 app[run.2251]: method, url, data=data, headers=request_headers, **kwargs) 2018-01-08T04:54:20.445867+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 508, in request 2018-01-08T04:54:20.446099+00:00 app[run.2251]: resp = self.send(prep, **send_kwargs) 2018-01-08T04:54:20.446101+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 618, in send 2018-01-08T04:54:20.446456+00:00 app[run.2251]: r = adapter.send(request, **kwargs) 2018-01-08T04:54:20.446457+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/adapters.py", line 440, in send 2018-01-08T04:54:20.446728+00:00 app[run.2251]: timeout=timeout 2018-01-08T04:54:20.446730+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen 2018-01-08T04:54:20.446969+00:00 app[run.2251]: chunked=chunked) 2018-01-08T04:54:20.446970+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request 2018-01-08T04:54:20.447229+00:00 app[run.2251]: conn.request(method, url, **httplib_request_kw) 2018-01-08T04:54:20.447231+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1239, in request 2018-01-08T04:54:20.447690+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1284, in _send_request 2018-01-08T04:54:20.448232+00:00 app[run.2251]: body = _encode(body, 'body') 2018-01-08T04:54:20.448234+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 161, in _encode 2018-01-08T04:54:20.448405+00:00 app[run.2251]: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 553626-553628: Body ('信用卡') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8. 2018-01-08T04:54:20.447689+00:00 app[run.2251]: self._send_request(method, url, body, headers, encode_chunked) 2018-01-08T04:54:20.448396+00:00 app[run.2251]: (name.title(), data[err.start:err.end], name)) from None 2018-01-08T04:54:20.621819+00:00 heroku[run.2251]: State changed from up to complete 2018-01-08T04:54:20.609814+00:00 heroku[run.2251]: Process exited with status 1

opened by 2legit 24
Import error with pandas_gbq

There's a bug with the most recent google bigquery library. Hence, this error occurs in pandas-gbq

ImportError: pandas-gbq requires google-cloud-bigquery: cannot import name 'collections_abc'

opened by winsonhys 23
When appending to a table, load if the dataframe contains a subset of the existing schema
Purpose

Current behavior of to_gbq is fail if the schema of the new data is not equivalent to the current schema. However, this means that the load fails if the new data is missing columns that are present in the current schema. For instance, this may occur when the data source I am using to construct the dataframe only provides non-empty values. Rather than determining the current schema of the GBQ table and adding empty columns to my dataframe, I would like to_gbq to load my data if the columns in the dataframe are a subset of the current schema.

Primary changes made

Factoring a schema function out of verify_schema to support both verify_schema and schema_is_subset

schema_is_subset determines whether local_schema is a subset of remote_schema

the append flag uses schema_is_subset rather than verify_schema to determine if the data can be loaded

Auxiliary changes made

PROJECT_ID etc are retrieved from an environment variable to facilitate local testing

Running test_gbq through autopep8 added a row after two class names
opened by mr-mcox 19
Set project_id (and other settings) once for all subsequent queries so you don't have to pass every time
One frustrating thing is having to pass the project_id (among other parameters) every time you write a query. For example, personally, I usually use the same project_id, almost always query with standard sql, and usually turn off verbose. I have to pass those three with every read_gbq, typing which adds up.

Potential options include setting an environment variable and reading from these default settings, but sometimes it can be different each time and fiddling with environment variables feels unfriendly. My suggestion would perhaps be to add a class that can wrap read_gbq() and to_gbq() in a client object. You could set the project_id attribute and dialect and whatever else in the client object, then re-use the object every time you want a query with those settings.

A very naive implementation here in this branch: https://github.com/pydata/pandas-gbq/compare/master...jasonqng:client-object-class?expand=1

Usage would be like:

>>> import gbq >>> client = gbq.Client(project_id='project-name',dialect='standard',verbose=False) >>> client.read("select 1") f0_ 0 1 >>> client.read("select 2") f0_ 0 2 >>> client.verbose=True >>> client.read("select 3") Requesting query... ok. Job ID: c7d7e4c0-883a-4e14-b35f-61c9fae0c08b Query running... Query done. Processed: 0.0 B Billed: 0.0 B Standard price: $0.00 USD Retrieving results... Got 1 rows. Total time taken 1.66 s. Finished at 2018-01-02 14:06:01. f0_ 0 3

Does that seem like a reasonable solution to all this extra typing or is there another preferred way? If so, I can open up a PR with the above branch.

Thanks, my tired fingers thank you all!

@tswast @jreback @parthea @maxim-lian
opened by jasonqng 17
Structs lack proper names as dicts and arrays get turned into array of dicts
Version 0.1.4

This query returns a improperly named dict:

q = """ select struct(a,b) col from (SELECT 1 a, 2 b) """ df = gbq.read_gbq(q, dialect='standard', verbose=False)

Compare with result from Big Query:

An array of items also get turned into a arrays of dicts sometimes. For example:

q = """ select array_agg(a) from (select "1" a UNION ALL select "2" a) """ gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

outputs:

Compare to Big Query:

These issues may or may not be related?
type: bug help wanted
opened by jasonqng 17
Printing rather than logging?

We're printing in addition to logging, when querying from BigQuery. This makes controlling the output much harder, aside from being un-idiomatic.

Printing in white, logging in red:

https://cloud.githubusercontent.com/assets/5635139/23176541/6028b884-f831-11e6-911a-48aa7741a4da.png
type: feature request

opened by max-sixty 17

BUG: Add bigquery scope for google credentials

Bigquery requires scoped credentials when loading application default credentials

Quick test code below, it will return "invalid token" error. When uncomment the create_scoped() statement, the code run correctly without any error.

# use google default application credentials
export GOOGLE_APPLICATION_CREDENTIALS=/PATH/TO/GOOGLE_DEFAULT_CREDENTIALS.json

import httplib2

from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials

credentials = GoogleCredentials.get_application_default()
#credentials = credentials.create_scoped('https://www.googleapis.com/auth/bigquery')

http = httplib2.Http()
http = credentials.authorize(http)

service = build('bigquery', 'v2', http=http)

jobs = service.jobs()
job_data = {'configuration': {'query': {'query': 'SELECT 1'}}}

jobs.insert(projectId='projectid', body=job_data).execute()

type: bug

opened by xcompass 16

read_gbq() unnecessarily waiting on getting default credentials from Google

When attempting to grant pandas access to my GBQ project, I am running into an issue where read_gbq is trying to get default credentials, failing / timing out, then printing out a URL to go to to grant the credentials. Since I'm not running this on google cloud platform, I do not expect to be able to get default credentials. In my case, I only want to run the CLI flow (without having oauth call back to my local server).

Here's the code

>>> import pandas_gbq as gbq
>>> gbq.read_gbq('SELECT 1', project_id=<project_id>, auth_local_webserver=False)

Here's what I see when I trigger a SIGINT once the query is invoked:

  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 214, in get_credentials
    credentials = self.get_application_default_credentials()
  File "/usr/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 243, in get_application_default_credentials
    credentials, _ = google.auth.default(scopes=[self.scope])
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 277, in default
    credentials, project_id = checker()
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 274, in <lambda>
    lambda: _get_gce_credentials(request))
  File "/usr/lib/python3.5/site-packages/google/auth/_default.py", line 176, in _get_gce_credentials
    if _metadata.ping(request=request):
  File "/usr/lib/python3.5/site-packages/google/auth/compute_engine/_metadata.py", line 73, in ping
    timeout=timeout)
  File "/usr/lib/python3.5/site-packages/google/auth/transport/_http_client.py", line 103, in __call__
    method, path, body=body, headers=headers, **kwargs)
  File "/usr/lib/python3.5/http/client.py", line 1106, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1151, in _send_request
    self.endheaders(body)
  File "/usr/lib/python3.5/http/client.py", line 1102, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python3.5/http/client.py", line 934, in _send_output
    self.send(msg)
  File "/usr/lib/python3.5/http/client.py", line 877, in send
    self.connect()
  File "/usr/lib/python3.5/http/client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.5/socket.py", line 702, in create_connection
    sock.connect(sa)
KeyboardInterrupt

I've also tried setting the env variable GOOGLE_APPLICATIONS_CREDENTIALS to empty. I'm using pandas-gbq version at commit 64a19b.

type: bug type: cleanup

opened by dfontenot 15

docs: fix reading dtypes

Hello. I faced the same confusion as #579, so have tried to update the docs.

Not only the BQ DATE type but also the TIME, TIMESTAMP, and FLOAT64 types seems to be wrong.

It seems to be due to a breaking change in google-cloud-bigquery v3.
https://cloud.google.com/python/docs/reference/bigquery/latest/upgrading#changes-to-data-types-loading-a-pandas-dataframe

I have confirmed the correct dtypes as:

>>> import pandas

>>> sql1 = """
SELECT
  TRUE AS BOOL,
  123 AS INT64,
  123.456 AS FLOAT64,

  TIME '12:30:00.45' AS TIME,
  DATE "2023-01-01" AS DATE,
  DATETIME "2023-01-01 12:30:00.45" AS DATETIME,
  TIMESTAMP "2023-01-01 12:30:00.45" AS TIMESTAMP
"""

>>> pandas.read_gbq(sql).dtypes
BOOL                        boolean
INT64                         Int64
FLOAT64                     float64
TIME                         dbtime
DATE                         dbdate
DATETIME             datetime64[ns]
TIMESTAMP       datetime64[ns, UTC]
dtype: object

>>> sql2 = """
SELECT
  DATE "2023-01-01" AS DATE,
  DATETIME "2023-01-01 12:30:00.45" AS DATETIME,
  TIMESTAMP "2023-01-01 12:30:00.45" AS TIMESTAMP,
UNION ALL
SELECT
  DATE "2263-04-12" AS DATE,
  DATETIME "2263-04-12 12:30:00.45" AS DATETIME,
  TIMESTAMP "2263-04-12 12:30:00.45" AS TIMESTAMP
"""

>>> pandas.read_gbq(sql2).dtypes
DATE         object
DATETIME     object
TIMESTAMP    object
dtype: object

Fixes #579 🦕

api: bigquery size: s

opened by yokomotod 1

feat: adds ability to provide redirect uri
WIP PR for discussion: aiming to provide the ability to include a redirect URI, client ID, and client secrets to facilitate the migration away from "out of band" OAuth authentication.

@tswast

See also changes in these repos:

https://github.com/googleapis/python-bigquery-pandas/pull/595 #python-bigquery-pandas

https://github.com/googleapis/google-auth-library-python-oauthlib/pull/259

https://github.com/pydata/pydata-google-auth/pull/58

api: bigquery size: m
opened by chalmerlowe 1
Problems installing the package on macOS M1 chip
Hi,

I am having problems to install this package on a macos with M1 chip.

The error: Could not find <Python.h>. This could mean the following: * You're on Ubuntu and haven't run apt-get install python3-dev. * You're on RHEL/Fedora and haven't run yum install python3-devel or dnf install python3-devel (make sure you also have redhat-rpm-config installed) * You're on Mac OS X and the usual Python framework was somehow corrupted (check your environment variables or try re-installing?) * You're on Windows and your Python installation was somehow corrupted (check your environment variables or try re-installing?)

[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

× Encountered error while trying to install package. ╰─> grpcio

note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure.

Environment details

OS type and version: MacOS 12.4 (chip M1)

Python version: Python 3.10.0

pip version: pip 22.3.1

Thanks!
api: bigquery
opened by davidgarciatwenix 0
Ability to handle a dry_run

Hi, after checking out the pandas_gbq.read_gbq call parametrization I see that I can supply configuration={'dry_run': True} to make the query job to be a dry run.

However it will still attempt to find query destination to try download rows, which in this case will be nonexistent. It would be great if the pandas_gbq would be aware of dry_run and just output the query stats to debug log or return some stats data.

e.g. querying something like this: pandas_gbq.read_gbq("SELECT * FROM 'my_project_id.billing_ds.cloud_pricing_export'", configuration={'dry_run': True})

still results in the exception

Traceback (most recent call last): File "big_query_utils.py", line 134, in print(read_df('SELECT * FROM 'my_project_id.billing_ds.cloud_pricing_export'', configuration={'dry_run': True})) File "/Users/.../big_query/big_query_utils.py", line 95, in read_df return pandas_gbq.read_gbq(sql_or_table_id, **gbq_kwargs) File "/Users/.../lib/python3.9/site-packages/pandas_gbq/gbq.py", line 921, in read_gbq final_df = connector.run_query( File "/Users/.../lib/python3.9/site-packages/pandas_gbq/gbq.py", line 526, in run_query rows_iter = self.client.list_rows( File "/Users/.../lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 3790, in list_rows table = self.get_table(table.reference, retry=retry, timeout=timeout) File "/Users/.../lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 1034, in get_table api_response = self._call_api( File "/Users/.../lib/python3.9/site-packages/google/cloud/bigquery/client.py", line 782, in _call_api return call() File "/Users/.../lib/python3.9/site-packages/google/api_core/retry.py", line 283, in retry_wrapped_func return retry_target( File "/Users/.../lib/python3.9/site-packages/google/api_core/retry.py", line 190, in retry_target return target() File "/Users/.../lib/python3.9/site-packages/google/cloud/_http/init.py", line 494, in api_request raise exceptions.from_http_response(response) google.api_core.exceptions.NotFound: 404 GET https://bigquery.googleapis.com/bigquery/v2/projects/my_project_id/datasets/_6a20f817b1e72d456384bdef157062be9989000e/tables/anon71d825e7efee2856ce2b5e50a3df3a2579fd5583d14740ca3064bab740c8ffd9?prettyPrint=false: Not found: Table my_project_id:_6a20f817b1e72d456384bdef157062be9989000e.anon71d825e7efee2856ce2b5e50a3df3a2579fd5583d14740ca3064bab740c8ffd9
type: feature request api: bigquery

opened by ehborisov 0

NUMERIC Field failing with conversion from NoneType to Decimal is not supported

Saving data to NUMERIC Field failing with conversion from NoneType to Decimal is not supported

python 3.9
pandas 1.5.1

Stack trace


...........

df.to_gbq(project_id=self.client.project,
          File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas-1.5.1-py3.9-macosx-10.9-x86_64.egg/pandas/core/frame.py", line 2168, in to_gbq
gbq.to_gbq(
    File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas-1.5.1-py3.9-macosx-10.9-x86_64.egg/pandas/io/gbq.py", line 218, in to_gbq
pandas_gbq.to_gbq(
    File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas_gbq-0.17.9-py3.9.egg/pandas_gbq/gbq.py", line 1198, in to_gbq
connector.load_data(
    File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas_gbq-0.17.9-py3.9.egg/pandas_gbq/gbq.py", line 591, in load_data
chunks = load.load_chunks(
    File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas_gbq-0.17.9-py3.9.egg/pandas_gbq/load.py", line 240, in load_chunks
load_parquet(
    File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas_gbq-0.17.9-py3.9.egg/pandas_gbq/load.py", line 128, in load_parquet
dataframe = cast_dataframe_for_parquet(dataframe, schema)
File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas_gbq-0.17.9-py3.9.egg/pandas_gbq/load.py", line 103, in cast_dataframe_for_parquet
cast_column = dataframe[column_name].map(decimal.Decimal)
File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas-1.5.1-py3.9-macosx-10.9-x86_64.egg/pandas/core/series.py", line 4539, in map
new_values = self._map_values(arg, na_action=na_action)
File "/Users/xxxx/.local/lib/python3.9/site-packages/pandas-1.5.1-py3.9-macosx-10.9-x86_64.egg/pandas/core/base.py", line 890, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/lib.pyx", line 2918, in pandas._libs.lib.map_infer
TypeError: conversion from NoneType to Decimal is not supported

api: bigquery

opened by ismailsimsek 1

ImportError: cannot import name 'external_account_authorized_user' from 'google.auth'

Environment details

OS type and version: Linux
Python version: 3.9
pip version: 21.2
pandas-gbq version: 0.17.9

Steps to reproduce

Running a simple query using

test = pd.read_gbq('select * from `data-production.dwh_core.transaction_code` limit 1', 
                   project_id='data-production', 
                   dialect='standard', 
                   location='asia-southeast2')

Results in:

ImportError: cannot import name 'external_account_authorized_user' from 'google.auth' (/opt/conda/lib/python3.8/site-packages/google/auth/__init__.py)

Stack trace

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-99714eec0d64> in <module>
----> 1 import pydata_google_auth

/opt/conda/lib/python3.8/site-packages/pydata_google_auth/__init__.py in <module>
----> 1 from .auth import default
      2 from .auth import get_user_credentials
      3 from .auth import load_user_credentials
      4 from .auth import save_user_credentials
      5 from .auth import load_service_account_credentials

/opt/conda/lib/python3.8/site-packages/pydata_google_auth/auth.py in <module>
      6 import google.auth.exceptions
      7 import google.oauth2.credentials
----> 8 from google_auth_oauthlib import flow
      9 import oauthlib.oauth2.rfc6749.errors
     10 import google.auth.transport.requests

/opt/conda/lib/python3.8/site-packages/google_auth_oauthlib/__init__.py in <module>
     19 """
     20 
---> 21 from .interactive import get_user_credentials
     22 
     23 __all__ = ["get_user_credentials"]

/opt/conda/lib/python3.8/site-packages/google_auth_oauthlib/interactive.py in <module>
     25 import socket
     26 
---> 27 import google_auth_oauthlib.flow
     28 
     29 

/opt/conda/lib/python3.8/site-packages/google_auth_oauthlib/flow.py in <module>
     67 import google.oauth2.credentials
     68 
---> 69 import google_auth_oauthlib.helpers
     70 
     71 

/opt/conda/lib/python3.8/site-packages/google_auth_oauthlib/helpers.py in <module>
     25 import json
     26 
---> 27 from google.auth import external_account_authorized_user
     28 import google.oauth2.credentials
     29 import requests_oauthlib

ImportError: cannot import name 'external_account_authorized_user' from 'google.auth' (/opt/conda/lib/python3.8/site-packages/google/auth/__init__.py)

Resolution

We had to downgrade the google-auth-oauthlib to 0.5.3:

 !pip install google-auth-oauthlib==0.5.3

It seems like the most recent change (October 25) broke something: https://pypi.org/project/google-auth-oauthlib/#history

api: bigquery

opened by benjamintanweihao 0

Releases(v0.18.1)

v0.18.1(Nov 29, 2022)
0.18.1 (2022-11-28)

Dependencies

Remove upper bound for python and pyarrow (#592) (4d28684)

Source code(tar.gz)
Source code(zip)
v0.18.0(Nov 24, 2022)
0.18.0 (2022-11-19)

Features

Map "if_exists" value to LoadJobConfig.WriteDisposition (#583) (7389cd2)

Source code(tar.gz)
Source code(zip)
v0.17.9(Sep 28, 2022)
0.17.9 (2022-09-27)

Bug Fixes

Updates requirements.txt to fix failing tests due to missing req (#575) (1d797a3)

Source code(tar.gz)
Source code(zip)
v0.17.8(Aug 9, 2022)
0.17.8 (2022-08-09)

Bug Fixes

deps: allow pyarrow < 10 (#550) (c21a414)

Source code(tar.gz)
Source code(zip)
v0.17.7(Jul 11, 2022)
0.17.7 (2022-07-11)

Bug Fixes

allow to_gbq to run without bigquery.tables.create permission. (#539) (3988306)

Source code(tar.gz)
Source code(zip)
v0.17.6(Jun 9, 2022)
0.17.6 (2022-06-03)

Documentation

fix changelog header to consistent size (#529) (218e06a)

Source code(tar.gz)
Source code(zip)
v0.17.5(May 19, 2022)
0.17.5 (2022-05-09)

Bug Fixes

deps: allow pyarrow v8 (#525) (a4ee0df)

Source code(tar.gz)
Source code(zip)
v0.17.4(Mar 14, 2022)
0.17.4 (2022-03-14)

Bug Fixes

avoid deprecated "out-of-band" authentication flow (#500) (4758e3a)

correctly transform query job timeout configuration and exceptions (#492) (d8c3900)

Source code(tar.gz)
Source code(zip)
v0.17.3(Mar 7, 2022)
0.17.3 (2022-03-05)

Bug Fixes

deps: require google-api-core>=1.31.5, >=2.3.2 (#493) (744a71c)

deps: require google-auth>=1.25.0 (744a71c)

deps: require proto-plus>=1.15.0 (744a71c)

Source code(tar.gz)
Source code(zip)
v0.17.2(Mar 3, 2022)
0.17.2 (2022-03-02)

Dependencies

allow pyarrow 7.0 (#487) (39441b6)

Source code(tar.gz)
Source code(zip)
v0.17.1(Feb 24, 2022)
0.17.1 (2022-02-24)

Bug Fixes

avoid TypeError when executing DML statements with read_gbq (#483) (e9f0e3f)

Documentation

document additional breaking change in 0.17.0 (#477) (a858c80)

Source code(tar.gz)
Source code(zip)
v0.17.0(Jan 19, 2022)
0.17.0 (2022-01-19)

⚠ BREAKING CHANGES

use nullable Int64 and boolean dtypes if available (#445)

Features

accepts a table ID, which downloads the table without a query (#443) (bf0e863)

use nullable Int64 and boolean dtypes if available (#445) (89078f8)

Bug Fixes

read_gbq supports extreme DATETIME values such as 0001-01-01 00:00:00 (#444) (d120f8f)

to_gbq allows strings for DATE and floats for NUMERIC with api_method="load_parquet" (#423) (2180836)

allow extreme DATE values such as datetime.date(1, 1, 1) in load_gbq (#442) (e13abaf)

avoid iteritems deprecation in pandas prerelease (#469) (7379cdc)

use data project for destination in to_gbq (#455) (891a00c)

Miscellaneous Chores

release 0.17.0 (#470) (29ac8c3)

Source code(tar.gz)
Source code(zip)
v0.16.0(Nov 8, 2021)
Features

to_gbq uses Parquet by default, use api_method="load_csv" for old behavior (#413) (9a65383)

allow Python 3.10 (#417) (faba940)

Miscellaneous Chores

release 0.16.0 (#415) (ea0f4e9)

Documentation

clarify table_schema (#383) (326e674)

Source code(tar.gz)
Source code(zip)
0.15.0(Mar 30, 2021)
Features

Load DataFrame with to_gbq to a table in a project different from the API client project. Specify the target table ID as project.dataset.table to use this feature. (#321, #347)

Allow billing project to be separate from destination table project in to_gbq. (#321)

Bug fixes

Avoid 403 error from to_gbq when table has policyTags. (#354)

Avoid client.dataset deprecation warnings. (#312)

Dependencies

Drop support for Python 3.5 and 3.6. (#337)

Drop support for google-cloud-bigquery==2.4.* due to query hanging bug. (#343)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.15.0.tar.gz(36.93 KB)
pandas_gbq-0.15.0-py3-none-any.whl(24.74 KB)
0.14.1(Nov 10, 2020)
Use object dtype for TIME columns. (#328)

Encode floating point values with greater precision. (#326)

Support INT64 and other standard SQL aliases in pandas_gbq.to_gbq table_schema argument. (#322)

https://pypi.org/project/pandas-gbq/0.14.1/
Source code(tar.gz)
Source code(zip)
0.14.0(Oct 5, 2020)
0.14.0 / 2020-10-05

Add dtypes argument to read_gbq. Use this argument to override the default dtype for a particular column in the query results. For example, this can be used to select nullable integer columns as the Int64 nullable integer pandas extension type. (#242, #332)

df = pandas_gbq.read_gbq( "SELECT CAST(NULL AS INT64) AS null_integer", dtypes={"null_integer": "Int64"}, )

Dependency updates

Support google-cloud-bigquery-storage 2.0 and higher. (#329)

Update the minimum version of pandas to 0.20.1. (#331)

Internal changes

Update tests to run against Python 3.8. (#331)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.14.0.tar.gz(36.39 KB)
pandas_gbq-0.14.0-py3-none-any.whl(23.99 KB)
0.13.3(Sep 30, 2020)
Include needed "extras" from google-cloud-bigquery package as dependencies. Exclude incompatible 2.0 version. (#324, #329)

PyPI
Source code(tar.gz)
Source code(zip)
pandas-gbq-0.13.3.tar.gz(35.16 KB)
pandas_gbq-0.13.3-py3-none-any.whl(23.88 KB)
0.13.1(Feb 13, 2020)
Fix AttributeError with BQ Storage API to download empty results. (#299)

PyPI
Source code(tar.gz)
Source code(zip)
pandas-gbq-0.13.1.tar.gz(35.37 KB)
pandas_gbq-0.13.1-py3-none-any.whl(23.31 KB)
0.13.0(Dec 12, 2019)
Raise NotImplementedError when the deprecated private_key argument is used. (#301)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.13.0.tar.gz(34.42 KB)
pandas_gbq-0.13.0-py3-none-any.whl(22.68 KB)
0.12.0(Nov 25, 2019)
New features

Add max_results argument to pandas_gbq.read_gbq(). Use this argument to limit the number of rows in the results DataFrame. Set max_results to 0 to ignore query outputs, such as for DML or DDL queries. (#102)

Add progress_bar_type argument to pandas_gbq.read_gbq(). Use this argument to display a progress bar when downloading data. (#182)

Dependency updates

Update the minimum version of google-cloud-bigquery to 1.11.1. (#296)

Documentation

Add code samples to introduction and refactor how-to guides. (#239)

Bug fixes

Fix resource leak with use_bqstorage_api by closing BigQuery Storage API client after use. (#294)

Release on PyPI
Source code(tar.gz)
Source code(zip)
pandas-gbq-0.12.0.tar.gz(35.07 KB)
pandas_gbq-0.12.0-py3-none-any.whl(23.46 KB)
0.11.0(Jul 29, 2019)
Breaking Change: Python 2 support has been dropped. This is to align with the pandas package which dropped Python 2 support at the end of 2019. (#268)

Enhancements

Ensure table_schema argument is not modified inplace. (:issue:278)

Implementation changes

Use object dtype for STRING, ARRAY, and STRUCT columns when there are zero rows. (#285)

Internal changes

Populate user-agent with pandas version information. (#281)

Fix pytest.raises usage for latest pytest. Fix warnings in tests. (#282 )

Update CI to install nightly packages in the conda tests. (#254)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.11.0.tar.gz(33.28 KB)
pandas_gbq-0.11.0-py3-none-any.whl(19.30 KB)
0.10.0(Apr 5, 2019)
Documentation

Document BigQuery data type to pandas dtype conversion for read_gbq. ( #269 )

Dependency updates

Update the minimum version of google-cloud-bigquery to 1.9.0. ( #247 )

Update the minimum version of pandas to 0.19.0. ( #262 )

Internal changes

Update the authentication credentials. Note: You may need to set reauth=True in order to update your credentials to the most recent version. This is required to use new functionality such as the BigQuery Storage API. ( #267 )

Use to_dataframe() from google-cloud-bigquery in the read_gbq() function. ( #247 )

Enhancements

Fix a bug where pandas-gbq could not upload an empty DataFrame. ( #237 )

Allow table_schema in to_gbq to contain only a subset of columns, with the rest being populated using the DataFrame dtypes ( #218 ) (contributed by @johnpaton)

Read project_id in to_gbq from provided credentials if available (contributed by @daureg)

read_gbq uses the timezone-aware DatetimeTZDtype(unit='ns', tz='UTC') dtype for BigQuery TIMESTAMP columns. ( #269 )

Add use_bqstorage_api to read_gbq. The BigQuery Storage API can be used to download large query results (>125 MB) more quickly. If the BQ Storage API can't be used, the BigQuery API is used instead. ( #133, #270 )

Source code(tar.gz)
Source code(zip)
0.9.0(Jan 11, 2019)
Warn when deprecated private_key parameter is used. #240

New dependency Use the pydata-google-auth package for authentication. #241

PyPI
Source code(tar.gz)
Source code(zip)
pandas-gbq-0.9.0.tar.gz(30.18 KB)
pandas_gbq-0.9.0-py2.py3-none-any.whl(15.12 KB)
0.8.0(Nov 12, 2018)
Breaking changes

Deprecate private_key parameter to pandas_gbq.read_gbq and pandas_gbq.to_gbq in favor of new credentials argument. Instead, create a credentials object using google.oauth2.service_account.Credentials.from_service_account_info or google.oauth2.service_account.Credentials.from_service_account_file. See the authentication how-to guide for examples. (#161, #231 )

Enhancements

Allow newlines in data passed to to_gbq. (#180)

Add pandas_gbq.context.dialect to allow overriding the default SQL syntax dialect. (#195, #235)

Support Python 3.7. (#197, #232)

Internal changes

Migrate tests to CircleCI. (#228, #232)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.8.0.tar.gz(31.35 KB)
pandas_gbq-0.8.0-py2.py3-none-any.whl(16.48 KB)
0.7.0(Oct 19, 2018)
int columns which contain NULL are now cast to float, rather than object type. (#174)

DATE, DATETIME and TIMESTAMP columns are now parsed as pandas' timestamp objects (#224)

Add :class:pandas_gbq.Context to cache credentials in-memory, across calls to read_gbq and to_gbq. (#198, #208)

Fast queries now do not log above DEBUG level. (#204) With BigQuery's release of clustering querying smaller samples of data is now faster and cheaper.

Don't load credentials from disk if reauth is True. (#212) This fixes a bug where pandas-gbq could not refresh credentials if the cached credentials were invalid, revoked, or expired, even when reauth=True.

Catch RefreshError when trying credentials. (#226)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.7.0.tar.gz(30.96 KB)
pandas_gbq-0.7.0-py2.py3-none-any.whl(16.12 KB)
0.6.1(Sep 4, 2018)
Improved read_gbq performance and memory consumption by delegating DataFrame construction to the Pandas library, radically reducing the number of loops that execute in python (#128)

Reduced verbosity of logging from read_gbq, particularly for short queries. (#201)

Avoid SELECT 1 query when running to_gbq. (#202)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.6.1.tar.gz(29.93 KB)
pandas_gbq-0.6.1-py2.py3-none-any.whl(15.46 KB)
0.6.0(Aug 21, 2018)
Warn when dialect is not passed in to read_gbq. The default dialect will be changing from 'legacy' to 'standard' in a future version. (#195 )

Use general float with 15 decimal digit precision when writing to local CSV buffer in to_gbq. This prevents numerical overflow in certain edge cases. (#192)

Source code(tar.gz)
Source code(zip)
pandas-gbq-0.6.0.tar.gz(29.62 KB)
pandas_gbq-0.6.0-py2.py3-none-any.whl(15.26 KB)
0.5.0(Jun 25, 2018)
Project ID parameter is optional in read_gbq and to_gbq when it can inferred from the environment. Note: you must still pass in a project ID when using user-based authentication. (#103)

Progress bar added for to_gbq, through an optional library tqdm as dependency. (#162)

Add location parameter to read_gbq and to_gbq so that pandas-gbq can work with datasets in the Tokyo region. (#177)

Source code(tar.gz)
Source code(zip)
0.4.1(Apr 6, 2018)
PyPI release

Only show verbose deprecation warning if Pandas version does not populate it. #157

Source code(tar.gz)
Source code(zip)
0.4.0(Apr 3, 2018)
PyPI release, Conda Forge release

Fix bug in read_gbq when building a dataframe with integer columns on Windows. Explicitly use 64bit integers when converting from BQ types. (#119)

Fix bug in read_gbq when querying for an array of floats (#123)

Fix bug in read_gbq with configuration argument. Updates read_gbq to account for breaking change in the way google-cloud-python version 0.32.0+ handles query configuration API representation. (#152)

Fix bug in to_gbq where seconds were discarded in timestamp columns. (#148)

Fix bug in to_gbq when supplying a user-defined schema (#150)

Deprecate the verbose parameter in read_gbq and to_gbq. Messages use the logging module instead of printing progress directly to standard output. (#12)

Source code(tar.gz)
Source code(zip)

Pandas Google BigQuery

Related tags

Overview

pandas-gbq

Installation

Install latest release version via conda

Install latest release version via pip

Install latest development version

Usage

Comments

Description

Purpose

Primary changes made

Auxiliary changes made

Environment details

Stack trace

Environment details

Steps to reproduce

Stack trace

Resolution

Releases(v0.18.1)

v0.18.1(Nov 29, 2022)

0.18.1 (2022-11-28)

Dependencies

v0.18.0(Nov 24, 2022)

0.18.0 (2022-11-19)

Features

v0.17.9(Sep 28, 2022)

0.17.9 (2022-09-27)

Bug Fixes

v0.17.8(Aug 9, 2022)

0.17.8 (2022-08-09)

Bug Fixes

v0.17.7(Jul 11, 2022)

0.17.7 (2022-07-11)

Bug Fixes

v0.17.6(Jun 9, 2022)

0.17.6 (2022-06-03)

Documentation

v0.17.5(May 19, 2022)

0.17.5 (2022-05-09)

Bug Fixes

v0.17.4(Mar 14, 2022)

0.17.4 (2022-03-14)

Bug Fixes

v0.17.3(Mar 7, 2022)

0.17.3 (2022-03-05)

Bug Fixes

v0.17.2(Mar 3, 2022)

0.17.2 (2022-03-02)

Dependencies

v0.17.1(Feb 24, 2022)

0.17.1 (2022-02-24)

Bug Fixes

Documentation

v0.17.0(Jan 19, 2022)

0.17.0 (2022-01-19)

⚠ BREAKING CHANGES

Features

Bug Fixes

Miscellaneous Chores

v0.16.0(Nov 8, 2021)

Features

Miscellaneous Chores

Documentation

0.15.0(Mar 30, 2021)

Features

Bug fixes

Dependencies

0.14.1(Nov 10, 2020)

0.14.0(Oct 5, 2020)

0.14.0 / 2020-10-05

Dependency updates

Internal changes

0.13.3(Sep 30, 2020)

0.13.1(Feb 13, 2020)

0.13.0(Dec 12, 2019)

0.12.0(Nov 25, 2019)

New features

Dependency updates