20240124-social-network-design
20240124-social-network-design
January 24, 2024
Question
如何設計給Linkedin、Facebook等公司可以處理非常大量資料的data structure?請說明如何設計計算兩個人的最短距離的 (e.g. src->B-C->dst) 演算法
Solutions
Step 1 Simplify the problem - forget about the millions of users
首先使用簡單方法,先從BFS,為何不是DFS,因為DFS會非常沒有效率 另一種方式是bidirectional BFS,也就是從source跟target同時進行BFS,碰撞到即找到最短路徑 使用visited hash table避免走已經走過的路徑
程式
people: A->B->C
def findPathBid(people, src, dst):
while True:
# search from src
collision = searchLevel(people, src, dst)
if collision:
return mergePaths(src, dst, collision)
# search from dst
collision = searchLevel(people, dst, src)
if collision:
return mergePaths(dst, src, collision)
return None
數學證明
假設每個人有k的朋友,路徑q BFS: $k+kk$ , O($k^q$) Bi BFS: $2k$, O($k^{\frac{q}{2}})$ e.g. A->B->C->D->E BFS: $100^4$ Bi BFS: $2100^2$
Step 2 Handle the Million of Users
太多資料可能不能只放在同一台機器,所以使用ID替代
- For each friend ID:
machine_idx = getMachineIDForUser(personID)
- Go to machine
#machine_idx
- On that machine, do:
friend = getPersonWithID(personID)
Optimization: Reduce machine jumps
在不同機器檢索(jump)很消耗資源、可以使用batch jump e.g. A,C,F同時在機器1,進行同時進行A,C,F的query
Optimization: Smart division of people and machines
人通常會和同一地區的人作為朋友,所以使用地區來做機器存取依據可以減少jump次數
Breadth-first search usually requires “making” a node as visited. How do you do that in this case?
使用hash table 檢索node id 查看是否被便利
Other Follow-Up
- In the real world, servers fail. How does this affect to you?
- How could you take advantage of caching?
- Do you search until the end of the graph(infinite)? How do you decide when to give up?
- In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use use the data to pick where to start traversing?