Jun 26, 2025
How GoDaddy built a category generation system at scale with batch inference for Amazon Bedrock | Artificial Intelligence and Machine Learning
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy Generative AI solutions have the potential to transform businesses by boosting productivity and
This post was co-written with Vishal Singh, Data Engineering Leader at Data & Analytics team of GoDaddy
Generative AI solutions have the potential to transform businesses by boosting productivity and improving customer experiences, and using large language models (LLMs) in these solutions has become increasingly popular. However, inference of LLMs as single model invocations or API calls doesn’t scale well with many applications in production.
With batch inference, you can run multiple inference requests asynchronously to process a large number of requests efficiently. You can also use batch inference to improve the performance of model inference on large datasets.
This post provides an overview of a custom solution developed for GoDaddy, a domain registrar, registry, web hosting, and ecommerce company that seeks to make entrepreneurship more accessible by using generative AI to provide personalized business insights to over 21 million customers—insights that were previously only available to large corporations. In this collaboration, the Generative AI Innovation Center team created an accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system.
GoDaddy wanted to enhance their product categorization system that assigns categories to products based on their names. For example:
GoDaddy used an out-of-the-box Meta Llama 2 model to generate the product categories for six million products where a product is identified by an SKU. The generated categories were often incomplete or mislabeled. Moreover, employing an LLM for individual product categorization proved to be a costly endeavor. Recognizing the need for a more precise and cost-effective solution, GoDaddy sought an alternative approach that was a more accurate and cost-efficient way for product categorization to improve their customer experience.
This solution uses the following components to categorize products more accurately and efficiently:
The key steps are illustrated in the following figure:
The security measures are inherently integrated into the AWS services employed in this architecture. For detailed information, refer to the Security Best Practices section of this post.
We used a dataset that consisted of 30 labeled data points and 100,000 unlabeled test data points. The labeled data points were generated by llama2-7b and verified by a human subject matter expert (SME). As shown in the following screenshot of the sample ground truth, some fields have N/A or missing values, which isn’t ideal because GoDaddy wants a solution with high coverage for downstream predictive modeling. Higher coverage for each possible field can provide more business insights to their customers.
The distribution for the number of words or tokens per SKU shows mild outlier concern, suitable for bundling many products to be categorized in the prompts and potentially more efficient model response.
The solution delivers a comprehensive framework for generating insights within GoDaddy’s product categorization system. It’s designed to be compatible with a range of LLMs on Amazon Bedrock, features customizable prompt templates, and supports batch and real-time (online) inferences. Additionally, the framework includes evaluation metrics that can be extended to accommodate changes in accuracy requirements.
In the following sections, we look at the key components of the solution in more detail.
We used Amazon Bedrock for batch inference processing. Amazon Bedrock provides the CreateModelInvocationJob API to create a batch job with a unique job name. This API returns a response containing jobArn. Refer to the following code:
We can monitor the job status using GetModelInvocationJob with the jobArn returned on job creation. The following are valid statuses during the lifecycle of a job:
The following is example code for the GetModelInvocationJob API:
When the job is complete, the S3 path specified in s3OutputDataConfig will contain a new folder with an alphanumeric name. The folder contains two files:
We then process the jsonl.out file in Amazon S3. This file is parsed using LangChain’s PydanticOutputParser to generate a .csv file. The PydanticOutputParser requires a schema to be able to parse the JSON generated by the LLM. We created a CCData class that contains the list of categories to be generated for each product as shown in the following code example. Because we enable n-packing, we wrap the schema with a List, as defined in List_of_CCData.
We also use OutputFixingParser to handle situations where the initial parsing attempt fails. The following screenshot shows a sample generated .csv file.
Prompt engineering involves the skillful crafting and refining of input prompts. This process entails choosing the right words, phrases, sentences, punctuation, and separator characters to efficiently use LLMs for diverse applications. Essentially, prompt engineering is about effectively interacting with an LLM. The most effective strategy for prompt engineering needs to vary based on the specific task and data, specifically, data card generation and GoDaddy SKUs.
Prompts consist of particular inputs from the user that direct LLMs to produce a suitable response or output based on a specified task or instruction. These prompts include several elements, such as the task or instruction itself, the surrounding context, full examples, and the input text that guides LLMs in crafting their responses. The composition of the prompt will vary based on factors like the specific use case, data availability, and the nature of the task at hand. For example, in a Retrieval Augmented Generation (RAG) use case, we provide additional context and add a user-supplied query in the prompt that asks the LLM to focus on contexts that can answer the query. In a metadata generation use case, we can provide the image and ask the LLM to generate a description and keywords describing the image in a specific format.
In this post, we briefly distribute the prompt engineering solutions into two steps: output generation and format parsing.
The following are best practices and considerations for output generation:
The following are best practices and considerations for format parsing:
You are a Product Information Manager, Taxonomist, and Categorization Expert who follows instruction well.
EVERY category information needs to be filled based on BOTH product name AND your best guess. If you forget to generate any category information, leave it as missing or N/A, then an innocent people will die.
Format your output in the JSON format (ensure to escape special character): The output should be formatted as a JSON instance that conforms to the JSON schema below. As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]} the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
Here is the output schema:
{“properties”: {“list_of_dict”: {“title”: “List Of Dict”, “type”: “array”, “items”: {“$ref”: “#/definitions/CCData”}}}, “required”: [“list_of_dict”], “definitions”: {“CCData”: {“title”: “CCData”, “type”: “object”, “properties”: {“product_name”: {“title”: “Product Name”, “description”: “product name, which will be given as input”, “type”: “string”}, “brand”: {“title”: “Brand”, “description”: “Brand of the product inferred from the product name”, “type”: “string”}, “color”: {“title”: “Color”, “description”: “Color of the product inferred from the product name”, “type”: “string”}, “material”: {“title”: “Material”, “description”: “Material of the product inferred from the product name”, “type”: “string”}, “price”: {“title”: “Price”, “description”: “Price of the product inferred from the product name”, “type”: “string”}, “category”: {“title”: “Category”, “description”: “Category of the product inferred from the product name”, “type”: “string”}, “sub_category”: {“title”: “Sub Category”, “description”: “Sub-category of the product inferred from the product name”, “type”: “string”}, “product_line”: {“title”: “Product Line”, “description”: “Product Line of the product inferred from the product name”, “type”: “string”}, “gender”: {“title”: “Gender”, “description”: “Gender of the product inferred from the product name”, “type”: “string”}, “year_of_first_sale”: {“title”: “Year Of First Sale”, “description”: “Year of first sale of the product inferred from the product name”, “type”: “string”}, “season”: {“title”: “Season”, “description”: “Season of the product inferred from the product name”, “type”: “string”}}}}}
We used the following prompting parameters:
For Llama 2, the model choices were meta.llama2-13b-chat-v1 or meta.llama2-70b-chat-v1. We used the following LLM parameters:
For Anthropic’s Claude, the model choices were anthropic.claude-instant-v1 and anthropic.claude-v2. We used the following LLM parameters:
The solution is straightforward to extend to other LLMs hosted on Amazon Bedrock, such as Amazon Titan (switch the model ID to amazon.titan-tg1-large, for example), Jurassic (model ID ai21.j2-ultra), and more.
The framework includes evaluation metrics that can be extended further to accommodate changes in accuracy requirements. Currently, it involves five different metrics:
The following are the approximate sample input and output lengths under some best performing settings:
The following table summarizes our consolidated quantitative results.
The following tables summarize the scaling effect in batch inference.
The following table summarizes the effect of n-packing. Llama 2 has an output length limit of 2,048 and fits up to around 20 packing. Anthropic’s Claude has a higher limit. We tested on 20 ground truth samples for 1, 5, and 10 packing and selected results from all model and prompt templates. The scaling effect on latency was more obvious in the Anthropic’s Claude model family than Llama 2. Anthropic’s Claude had better generalizability than Llama 2 when extending the packing numbers in output.
We only tried a few shots with Llama 2 models, which showed improved accuracy over zero-shot.
We noted the following qualitative results:
We had the following key business takeaways:
We had the following key technical takeaways:
The following are the recommendations that the GoDaddy team is considering as a part of future steps:
In this post, we shared how the Generative AI Innovation Center team worked with GoDaddy to create a more accurate and cost-efficient generative AI–based solution using batch inference in Amazon Bedrock, helping GoDaddy improve their existing product categorization system. We implemented n-packing techniques and used Anthropic’s Claude and Meta Llama 2 models to improve latency. We experimented with different prompts to improve the categorization with LLMs and found that Anthropic’s Claude model family gave the better accuracy and generalizability than the Llama 2 model family. GoDaddy team will test this solution on a larger dataset and evaluate the categories generated from the recommended approaches.
If you’re interested in working with the AWS Generative AI Innovation Center, please reach out.
Vishal Singh is a Data Engineering leader at the Data and Analytics team of GoDaddy. His key focus area is towards building data products and generating insights from them by application of data engineering tools along with generative AI.
Yun Zhou is an Applied Scientist at AWS where he helps with research and development to ensure the success of AWS customers. He works on pioneering solutions for various industries using statistical modeling and machine learning techniques. His interest includes generative models and sequential data modeling.
Meghana Ashok is a Machine Learning Engineer at the Generative AI Innovation Center. She collaborates closely with customers, guiding them in developing secure, cost-efficient, and resilient solutions and infrastructure tailored to their generative AI needs.
Karan Sindwani is an Applied Scientist at AWS where he works with AWS customers across different verticals to accelerate their use of Gen AI and AWS Cloud services to solve their business challenges.
Vidya Sagar Ravipati is a Science Manager at the Generative AI Innovation Center, where he uses his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption.
Loading comments…
SubmittedSubmittedInProgressInProgressFailedFailedCompletedCompleted StoppedStoppedInProgressSucceededFailedjson.out<file_name>.jsonl.outProvide simple, clear and complete instructionsUse separator characters consistentlyDeal with default output values such as missingUse few-shot promptingUse packing techniquesTest for good generalizationUse additional techniques for Anthropic’s Claude model familiesUse additional techniques for Llama model familiesRefine the prompt with modifiersRole assumptionPrompt specificityOutput format descriptionPay attention to few-shot example formattingUse additional techniques for Anthropic’s Claude model families Use additional techniques for Llama 2 model families Number of packingsNumber of in-context examplesFormat instructionContent coverage Parsing coverage Parsing recall on product nameParsing precision on product nameFinal coverage Human evaluation Input length for Llama 2 model familyInput length for Anthropic’s Claude model familyOutput length with 5-packingConfigLatencyAccuracyBatch process serviceModelPromptBatch process latency (5 packing)Near-real-time process latency (1 packing)Programmatic evaluation (coverage)Claude-v1 (instant)zero-shot (template6)29s44.8/20=2.24s96.80%644s99.40%Batch process serviceModelPromptBatch process latency (5 packing)Near-real-time process latency (1 packing)Batch process serviceNear-real-time process latency (1 packing)Programmatic evaluation (coverage)Batch process serviceModelPromptLatency Accuracy Human evaluation Learnings Llama2 Anthropic’s Claude Improved latencyMore cost-effectivenessEnhanced accuracyQualitative assessmentDataset enhancement Human evaluation Fine-tuning Prompt engineering Few-shot learning Knowledge integration Vishal Singh Yun Zhou Meghana Ashok Karan Sindwani Vidya Sagar Ravipati
