18
How To Access And Query Your Google BigQuery Data Using Python And R
In this post, we see how to load Google BigQuery data using Python and R, followed by querying the data to get useful insights. We leverage the Google Cloud BigQuery library for connecting BigQuery Python, and the bigrquery library is used to do the same with R.
We also look into the two steps of manipulating the BigQuery data using Python/R:
- Connecting to Google BigQuery and accessing the data
- Querying the data using Python/R
In this post, we assume that you have all your customer data stored in Google BigQuery.
If you are interested in learning more about how to get your data from your data sources into tools like Google BigQuery and other data warehouse solutions in real-time, you should explore the Customer Data Infrastructure tools like RudderStack.
Python is one of the most widely-used general-purpose programming languages out there. It has gained a lot of attention and popularity because of its ease of use and flexibility.
Many engineers and data science teams also prefer Python because of the extensive libraries and tools available at their disposal to connect with other third-party systems to manipulate the data.
To query your Google BigQuery data using Python, we need to connect the Python client to our BigQuery instance. We do so using a cloud client library for the Google BigQuery API. You can also choose to use any other third-party option to connect BigQuery with Python; the BigQuery-Python library by tylertreat is also a great option.
We use the Google Cloud BigQuery library because it is stable and officially supported by Google.
For this post, we assume that you already have a Python development environment set up. If not, we highly recommend you refer to the Python Development Environment Setup Guide.
To install the library, run the following command from your terminal:
pip install --upgrade google-cloud-bigquery
Next, we connect the client to the database. To do this, you will need to download a JSON file that contains the BigQuery service account credentials. If you don't have a service account, follow this guide to create one, and then proceed to download the JSON file to your local machine.
Now that we have everything set up, we proceed to initialize the connection. The following Python code is used to do so:
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
'path/to/file.json')
project_id = 'my-bq'
client = bigquery.Client(credentials= credentials,project=project_id)
In the snippet above, you will need to specify the project_id
and the location of your JSON key file by replacing the 'path/to/file.json'
with the actual path to the locally stored JSON file.
In Google BigQuery, the project is a top-level container and provides default access control across all the datasets.
Now that we have the BigQuery client set up and ready to use, we can execute queries on the BigQuery dataset.
For this, we use the query method, which inserts a query job into the BigQuery queue. These queries are then executed asynchronously -- in the sense that we do not specify any timeout, and the client waits for the job to complete. As soon as the job is complete, the method returns a Query_Job
instance containing the results.
For more details on how this method works, please refer to the official documentation here.
The required Python code is as follows:
query_job = client.query("""
SELECT *
FROM dataset.my_table
LIMIT 1000 """)
results = query_job.result() # Wait for the job to complete.
Please note that the above query uses the standard SQL syntax as a default. If you wish you to use legacy SQL, use the code below:
job_config.use_legacy_sql = True
query_job = client.query("""
SELECT *
FROM dataset.my_table
LIMIT 1000""", job_config = job_config)
results = query_job.result() # Wait for the job to complete.
R is a popular alternative to Python used by many data scientists and engineers. If detailed, methodical statistical data analysis is your goal, then very few languages are as good as R.
When it comes to working with Google BigQuery, R too offers a robust and easy-to-use library for data manipulation and querying. For this post, we use the library bigrquery created and maintained by Hadley Wickham, a Chief Scientist at RStudio.
For this section, we assume that you have set up the R development environment already. If not, you can always follow this guide to set up RStudio.
To install bigrquery, we run the following command from within R console:
install.packages("bigrquery")
And that's it! We are good to go.
As with Python, we will need to authorize our R client to access Google Cloud Services. As per the documentation for bigrquery, we will follow the prompt within the R console to open the authorization URL and copy the code into the console.
Note that this authorization needs to be done only once. The subsequent requests will automatically refresh the access credentials.
To execute queries on the BigQuery data with R, we will follow these steps:
- Specify the project ID from the Google Cloud Console, as we did with Python.
- Form your query string to query the data.
- Call
query_exec
with your project ID and query string.
The code to implement this is as below:
#import library
library(bigrquery)
#Your project ID here
project_id <- "your-project-id" # Your project ID goes here
#Sample query
sql_string <- "SELECT * FROM dataset.my_table LIMIT 1000"
#Execute the query and storing the result
query_results <- query_exec(sql_string, project = project_id, useLegacySql = FALSE)
As with Python, if you wish you can execute queries using legacy SQL, you can change useLegacySql
to TRUE
in your query_exec
function.
In this post, we saw how easy and straightforward it is to access and manipulate the data stored in Google BigQuery using Python and R.
These two languages make it quite easy to build a statistical model on top of this data, which can be used for various purposes -- understanding customers' in-app behavior, predicting the churn rate, etc. are just some of the use-cases.
There is a significant advantage to using a database to store your data compared to using other mediums such as CSV files. Apart from the flexibility to store large volumes of data with varying data types, you can leverage SQL's power to generate complex queries that give you meaningful insights.
While this post focuses on Google BigQuery, using any other database tool with R and Python is equally easy. The only difference will be the choice of Python/R library used to connect to the database.
Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get started
18