How to use Google Bard in BigQuery (quickstart)

Google just released LLM support inside of BigQuery. Here's how to start using it today.

Seamus Abshere
Seamus Abshere on Jun 27, 2023

Google just announced you can use Bard in BigQuery to do Generative AI at scale. That means you can run your queries right inside of your data warehouse, where your data lives. No more rate limits, HTTP clients, error messages, etc.

Well, it's not exactly Bard... it's actually PaLM 2, called text-bison in Vertex AI land.

It works as "one shot" prompts - where you give it context inside of the prompt itself. It won't remember what you tell it, so it's safe to use on biz data. You can use Vertex AI API to fine-tune your model and I'm expecting this to be available via SQL soon.

First, create a US-based dataset: (or use one you already have... or another region that supports LLM)

bq mk --dataset --location us --project_id=$PROJECT_ID $DATASET_ID

Next, create a BigQuery connection: (there is no way to do this in SQL as far as I know)

$ bq mk --connection --connection_type=CLOUD_RESOURCE --location us --project_id=$PROJECT_NAME
Connection 78793318712.us.40833e7c-1c99-440f-911b-c553cf8485f1 successfully created

Next, see what service account is associated with that connection: (look at the end of the table)

$ bq show --connection 78793318712.us.40833e7c-1c99-440f-911b-c553cf8485f1
Connection 78793318712.us.40833e7c-1c99-440f-911b-c553cf8485f1

                          name                           friendlyName   description    Last modified         type        hasCredential                                            properties
 ------------------------------------------------------ -------------- ------------- ----------------- ---------------- --------------- -----------------------------------------------------------------------------------------------
  78793318712.us.40833e7c-1c99-440f-911b-c553cf8485f1                                25 Jun 17:35:01   CLOUD_RESOURCE   False           {"serviceAccountId": "bqcx-78793318712-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com"}

Next, give the service account associated with that connection access to Vertex AI:

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member='serviceAccount:bqcx-78793318712-abcd@gcp-sa-bigquery-condel.iam.gserviceaccount.com' \
  --role='roles/aiplatform.user'

Next, create a remote model inside of BigQuery: (use the connection ID you got above, and note that I'm naming the model my_llm_1)

CREATE OR REPLACE MODEL
  `$PROJECT_ID.$DATASET_ID.my_llm_1`
REMOTE
WITH CONNECTION `projects/$PROJECT_ID/locations/us/connections/40833e7c-1c99-440f-911b-c553cf8485f1`
OPTIONS (remote_service_type ='cloud_ai_large_language_model_v1')

Then you should be able to run LLM queries:

SELECT
  *
FROM
  ML.GENERATE_TEXT(
    MODEL `$PROJECT_ID.$DATASET_ID.my_llm_1`,
    (
      SELECT
        "Say hello world" AS prompt
    ),
    STRUCT()
  )

But the killer app for LLM in the data warehouse is to run prompts across tables full of data. Here's how I tried sentiment analysis for some financial products:

CREATE TABLE `$PROJECT_ID.$DATASET_ID.my_outputs` AS (
  SELECT
    my_id,
    ml_generate_text_result["predictions"][0]["content"] AS sentiment_analysis,
  FROM
    ML.GENERATE_TEXT(
      MODEL `$PROJECT_ID.$DATASET_ID.my_llm_1`,
      (
        SELECT
          -- return my id so I can join this back to my data
          my_id,
          CONCAT(
            "Please provide an entity sentiment analysis of bonds, annuities, and CDs in the following text: ",
            text_content
          ) AS prompt,
        -- here's the magic: running a prompt for every row of `my_inputs` table
        FROM `$PROJECT_ID.$DATASET_ID.my_inputs`
      ),
      -- tell the LLM not to be super creative, but just return the highest probability results always
      STRUCT(
        0 AS temperature,
        1 AS top_k
      )
    )
)

Here's the final production query that uses some regex parsing to extract structured data from the LLM response:

CREATE TABLE `$PROJECT_ID.$DATASET_ID.my_outputs` AS (
  SELECT
    my_id,
    CASE
      -- if it gives back raw JSON, use it
      WHEN LEFT(JSON_VALUE(ml_generate_text_result["predictions"][0]["content"]), 1) = '{'
      THEN SAFE.PARSE_JSON(JSON_VALUE(ml_generate_text_result["predictions"][0]["content"]))
      -- if it gives back JSON inside a markdown block, use it
      ELSE SAFE.PARSE_JSON(REGEXP_EXTRACT(JSON_VALUE(ml_generate_text_result["predictions"][0]["content"]), r'```json([^`]+)```'))
    END AS sentiment_analysis,
  FROM
    ML.GENERATE_TEXT(
      MODEL `$PROJECT_ID.$DATASET_ID.my_llm_1`,
      (
        SELECT
          my_id,
          CONCAT(
            "Please provide, as a JSON object with only entity and sentiment, an entity sentiment analysis of bonds, annuities, and CDs in the following text: ",
            -- remove backticks that might interfere with parsing response
            REGEXP_REPLACE(text_content, r'`', "")
          ) AS prompt,
        FROM `$PROJECT_ID.$DATASET_ID.my_inputs`
      ),
      STRUCT(
        0 AS temperature,
        1 AS top_k
      )
    )
)

I'm hoping it will get easier to extract structured results soon, using jsonformer for example.

Thanks to the BigQuery ML team for implementing this. I think it's the obvious place for LLM!

Ready for easy AI?

Skip the ML struggle and focus on your downstream application. We have built-in sample data so you can get started without sharing yours.