[American President]1 [Bill Clinton]1 has been invited by the [Russian President]2, [Vladimir Putin]2, to visit [Russia]3. [President Clinton]3 said [he]1 looks forward to [his]1 visit.This paper's main objective is to treat this as an online clustering problem since the mentions follow a natural order. This follows the linguistic intuition that humans are likely to use to resolve coreference in a statement.
They further present their procedure, the Latent Left-Linking Model (L3M). L3M assumes that since the data arrives in an order, to cluster a given item it is sufficient to only consider the previous items. L3M makes 3 modeling assumptions:
- Left-Linking. An item i can only link (probabilistically) to items before it
- Independence of Left-links. The probability of an item i linking to and antecedent item j is independent of any other item i' that links to j.
- Probabilistic Left-link. For an item set d, the probability of an item i ≥ 1 linking to an antecedent item j (0 ≤ j < i), $P[j \leftarrow i; d,{\bf w}] $: $$Pr[j \leftarrow i; d,{\bf w}] = \frac{exp(\frac{{\bf w}_{ij}}{\gamma})}{Z_i(d,{\bf w},\gamma)}$$ where $\gamma$ is a temperature parameter and $Z_i$ is the normalization factor. Essentially, this means that $P[j \leftarrow i; d,{\bf w}] $ is $exp(\frac{{\bf w}_{ij}}{\gamma})$ normalized to be a probability (it sums to one). $w$ is a weight vector that consists of different features indicative of the similarity of items $i$ and $j$ such as the cosine similarity.
The goal of the model is to cluster to items. An item is clustered into the cluster to which is has the highest probability of belonging to (greedy approach). If an item has the highest probability with a dummy cluster, a new cluster is formed with that item. The probability that an item joins a previously formed cluster $c$ is the sum of probabilities of the $i$ linking to items in $c$: $$Pr[c \odot i;d,{\bf w}] = \sum_{j \in c, o\le j \lt i} Pr[j \leftarrow i; d {\bf w}]$$. The feature vector ${\bf w}$ is learned by minimizing the regularized negative log-likelihood of the data. Casting the problem as a latent tree, a log-likelihood loss function can be minimized to find ${\bf w}$ using stochastic gradient descent.
I am left with a few questions, namely, how is ${\bf w}$ initialized? At the conclusion of the paper, they randomly initialize ${\bf w}$ to show that their algorithm is robust to the non-convexity of the problem. But, it wasn't clear to me how ${\bf w}$ as initialized in the previous experiments.
No comments:
Post a Comment