Google Typeahead System Design

8 min readApr 2, 2021

Design Google Typeahead System (Autocomplete Suggestions).

Problem Statement:

Return top 5 suggestions for the typed characters in the search bar. Suggestions returned should be most recently and most frequently searched words starting with the typed prefix.

Example:
Search Prefix: CA

Stored words starting with CA:
CAT → 20K
CAMEL → 30K
CAR → 50K
CABLE → 10K
CATERPILLAR → 5K
CABS → 100K
CATASTROPHIC → 75K
CAMERA → 40K

Returned Results:
CABS
CATASTROPHIC
CAR
CAMERA
CAMEL

Note: Time to be spent on design is 45 mins to 1 hour.

Design is divided into 4 sections:

Requirements
Estimations
Data Modelling and Detailed Design
Monitoring, Alerting and Backups.

Requirements Gathering

Start your design by listing down functional and non-functional requirements. Spend about 5 to 7 mins to come up with the requirements.

Functional Requirements → Defines the key features of the product. Usually defined by product managers.
Non Functional Requirements → Defines the quality attribute of the product. Engineers are responsible to come up with the non-functional requirements. Incorrect non functional requirements might hamper the end to end experience of the users.

Functional Requirements:

Return top 5 suggestions based on the prefix.
The frequency of the searched word should be updated in the system.
Results should be returned after every 500 ms.
To start with considering only words and not sentences.
Multiple languages support.
Customized Suggestions (Out of Scope)
Region specific searches (Out of Scope)

Non Functional Requirements:

Latency → Low Latency, maybe as low as 50 ms. Users should have almost real time experience.
Availability → High Availability, close to 99.999% (Wikipedia Link)
Consistency → Eventual Consistency should be fine.

Note: As per the CAP Theorem, in a distributed system environment you can either have Consistency or Availability along with the Partition Tolerance.
In this particular design, availability will be preferred over consistency. It should be fine if the returned suggestions are not real time and are eventually consistent, but the system should be highly available.

Estimations

Spend about 5–7 mins to come up with the estimations for the design.

Usually, there are two parts of estimations:

Capacity Planning → Storage required to store the data for next 1–3 years.
QPS (Queries Per Second) Planning → Number of read/write query system needs to handle in a second.

Assumptions:

1 Billion Search Queries per day
The average size of the word is 7–8 chars.
500M words already present in the dictionary.
100K new words added in the dictionary per day.

QPS Planning:

Read QPS → For a search query, the result is returned after every 500 ms and avg. size of the word is 7–8 characters, so 3 read calls will be made to the server for each server query assuming user types around 2-3 chars in every 500 ms.
Reads per day = 1Billion * 3 = 3 * 10⁹
Read per second →
3 * 10⁹ / (24 * 60 * 60) = 3 * 1⁰⁹ / (86400) ~= 3 * 10⁹ / 10⁵ ~= 3 * 10⁴
Load won’t be evenly distributed throughout the day, so considering a multiplier factor of 1.5.
Read QPS = 3 * 10⁴ * 1.5 = 45000
Write QPS → For a search query, the data is written into the system once the user selects a suggested word or press enter. Assuming read:write ratio is 3:1 as for each search query there were 3 reads to the system, there will be only 1 write.
Write QPS = 15000

Capacity Planning:
Assuming, all the data are words and their frequencies are stored in the system. Capacity might change based on the design going forward and once the design is completed, you can re-look into the data you have to store.

Note: Currently, the capacity planning is done without considering any replication of data.

Capacity in bytes:
Considering 500M words already present in the dictionary and 100K new words generated on a daily basis.
Each character is stored in the Unicode, so memory used to store each char is equal to 2 bytes.

Capacity = (500M words) * (50 bytes metadata for each word + 2 bytes * 8 characters) + (100K new words) * (365 days) * (50 bytes metadata for each word + 2bytes * 8 characters )

= ~ 500M * 70 bytes+ 100K * ~400 * 70 bytes
= 35 * 10⁹ + ~ 28 * 10⁸
= ~ 36 * 10⁹ bytes
= 36 * 10⁹ / 1000 * 1000 * 1000 GB
= 36 GB per year

Detailed Design

Spend around 30–35 mins on this section.

API Design:

Read API → User is waiting for the result
Get Request
getSuggestions(char[] prefix) returns Top 5 Suggestions
Note: System is a read-heavy system and should optimize for returning the results asap. Return Top 5 Suggestions.
Update API → User is not waiting for the result
Post Request
updateFreqForWord(char[] word)

Write can be slower as well because users really don’t care about the increase in the frequency in the table. → Preference to be given to read before the write.

High Level Design:

Reads should be really fast. Writes can be async as user is not waiting for the results of write.
To make reads really fast, we can pre-compute the results and store top 5 suggestions for each of the prefixes in the DB, use in-memory or distributed cache to store the results of common prefixes.
Writes can be written separately and whenever the threshold value for a particular word is crossed, it can be updated in the table storing the results corresponding to each of the prefixes.

LB represents Load Balancer, responsible for balancing the load between different instances of the same service/DB instances containing the same data.
Web Servers are responsible for orchestrating the calls to different backend services. Auths are also triggered from web services (Not required for current design).
Backend Servers → Using Microservice architecture where each service contain APIs related to specific entities. Application servers usually contains the business logic and responsible for calling DBs to store/fetch data.
Tables are sharded and replicated.

Tables:
Going ahead with No-Sql DB as it’s easier to shard data across multiple machines. We don’t have any specific structure for the data, consistency is not a strict requirement and we need high availability along with the partition tolerance which can be supported out of the box by No-Sql DB.

Note: To choose the DB amongst SQL and NoSql following parameters are considered:
CAP Theorem
If ACID properties are required while storing the data, prefer SQL DB in such cases.
Is it a structured data (well defined columns).

Table 1 (Suggestion Table)
String prefix ← Key (Pk)
List<Pair<Freq, String>> top 5 Suggestions ← Value
Table 2 (WordFrequency Table)
String word ← Key (Pk)
Int64 Freq ← Value

Data is read from Table-1 and writes are performed on Table-2. Whenever a word in the Table-2 crosses a threshold value for the frequency of the word, offline job takes care of updating Table-1 in all the prefixes for the word along with the frequency.

Scaling:
Types → Horizontal and Vertical Scaling.
You can go ahead with a mix of horizontal and vertical scaling.

Sharding:
Sharding is required if there is a Horizontal Scaling.

Suggestion Table → Range Based sharding

A,B,C…. Z → 26 Instances created based on the prefix start.
If any of the instances is heavily loaded then 2nd layer of sharding again based on range. For Ex if instance with prefix “A” is heavily loaded then shard further: AA — AP, AQ — AZ

WordFrequency Table → Date Based sharding (Type of Range based sharding where range is Date)

For this design, we don’t care about the word which was searched an year back or few months back or few days back. We care about the words which were searched more recently and frequently.
Easier to maintain the data for each date. Frequency of last 3 to 4 days can be considered for a word and different weight can be allocated based on ageing. Word can be considered to be updated in the Suggestion table if the threshold value is crossed.

Replications:
Replications are required as we are aiming for high availability and load should be distributed across multiple instances to have low latency for the response and to avoid single point of failure if a machine goes down.

Suggestion Table → upto 5 replications (all servers at the same level). Only used for reads.
(WordFreq Table) → (Master — Master) → 2 replications. With Async Sync of data between the masters.

Load Balancing:
Helps in better distribution of load across multiple instances.
Required between following layers:
1. Client and Web Servers
2. Web Servers and Application Servers
3. Application Servers and DB Servers

Weighted Round Robin Strategy can be used to distribute the load between different instances. Resource size, response time and other factors for different instances are considered in a weighted round robin algorithm to distribute the load.

Auths (Authentications and Authorizations):
Not required for this design at the user level.

Caching:
Cache can be used to store common prefixes along with top 5 suggestions.

80–20 Rule states that 80% of the requests on the system will query for 20% of the data and remaining 20% of the requests will be querying on 80% of the data.

Cache those 20% prefixes which will be queried frequently.
Trie → It’s a prefix tree which optimizes on space to store the words having the same prefix.
At every level of the node top 5 suggestions will be stored.
Struct TrieNode {
int freq;
int arr[26] children;
List<pair<string, int>> suggestions; → Top 5 suggestions
};
Write around strategy can be used to store the data in the cache.
Cache Invalidation Strategy → LRU (Least Recently Used) can be used to remove the data from the cache. TTL for the cache can be 12–24 hours. Cache should be invalidated if the prefix is updated in the suggestion table from an offline job.

Note: Here, we are implementing our own Cache and not using any standard Cache (Memcache, Redis) etc.

Monitoring and Alerting and Backups

Spend the last 2–3 mins on this section if the time permits.

Monitor the latency of each of the API.
Monitor success and failure rate of each of the API
Monitor QPS for each of the API/service to scale up/scale down as per the resource usage and changes in the QPS.
Monitoring on cache.
Alerts on failure → if requests fails x times in a window of a time (1 hours, 10 mins, 20 mins) then send an alert.
Alerting on the latency → If x% of requests from total requests within a window is taking more than t ms to return the results.
BackUps → Required to data recovery, monitoring product related metrics , creating dashboards.

Thank You Note:

Thanks for going through the above design. Please drop a comment or clap if the post was helpful. Feedbacks are welcomed.

About Myself:

I am Prakhar Agrawal, graduated from NIT-Bhopal, currently working as a Software Engineer at Google.
I take 1:1 sessions and trainings on Data Structure, Algorithms and System design. If you are looking to upgrade your skill sets in these areas then you can reach out to me over:
Email: prakhar2082@gmail.com
LinkedIn: https://www.linkedin.com/in/prakhar-agrawal-58218279/

Google Typeahead System Design

Written by Prakhar Agrawal

Responses (3)