Debugging VitePress localSearchPlugin Bug
Lee Dogeon @moreal@hackers.pub
The original article is available at https://moreal.hashnode.dev/vitepress-localsearch-debugging. The content is the same, but I'm using Hashnode for image hosting, so I'm leaving this note as a courtesy 🙏
Yesterday, I was planning and trying to implement a bot that would notify about trending articles from a Japanese technical blog service called Zenn. Since the bot was targeting the ActivityPub protocol, I looked into BotKit, and to understand what methods BotKit provides, I ended up searching through the Fedify documentation. While doing so, I noticed that the search function in the Fedify documentation was broken, so I submitted an issue. I was then asked to also report the issue to VitePress. This article is about how I went from investigating the issue to submitting a PR - a classic case of yak shaving 🤣
Reproducing the Bug (Identifying the Cause Variables)
When writing an issue for the VitePress repository, I needed to explain how to reproduce the problem, identify the specific situations where it occurs, and explain why it happens, so I tried various approaches to reproduce the issue.
Initially, as I had reported in the Fedify issue, search results stopped appearing from the point where code blocks existed, so I hypothesized that content containing code blocks wasn't being included in search results.
Since the Fedify documentation uses many plugins, it took time to get feedback. I initialized a project in a temporary directory (/tmp
), installed the vitepress@1.6.3
package used by Fedify, and ran yarn vitepress init
to generate documents with the default template. Then I set the search.provider
value to "local"
in .vitepress/config.mts
. I added headings from the Fedify documentation, but the search worked fine.
I decided to try replicating more of the environment. I started by applying markdown-related settings, beginning with @shikijs/vitepress-twoslash
, but the bug didn't appear. I then added more markdown syntax-related plugins, and the bug was reproduced. Next, I removed the markdown plugins one by one to identify which one was causing the issue, and confirmed that the bug appeared when using the markdown-it-jsr-ref plugin.
The markdown-it-jsr-ref plugin connects markdown like `Type`
that refers to specific types or functions to documentation on JSR. Thinking about how the HTML output changes, it wraps what was previously only wrapped in a <code>
tag with an additional <a>
tag. Based on this, I created two simple test cases: ## With <a>a tag</a> heading
and ## With <code>code tag</code> heading
. I confirmed that for the first case, only "With " appeared in the search results.
After defining a test case that reproduced the bug, it became clear that this was a VitePress bug. I couldn't find anything in the CommonMark spec stating that <a>
tags shouldn't be in headings.
Fixing the Bug
Since I wasn't familiar with the VitePress codebase, I first needed to find which code was causing the issue. Since I was using "local"
as the value for the search.provider
setting, I searched for the keyword "local". This led me to a file called localSearchPlugin.ts
. Looking at the content, I saw it was using a search library called MiniSearch, which confirmed I was on the right track.
First, I set a breakpoint on the clearHtmlTags
function, which seemed to be removing all HTML tags for the text shown in search results, and traced through the debugger to understand the parent splitPageIntoSections
function.
I noticed that the two test cases were processed differently at the line using the headingContentRegex
regular expression. Understanding this regex, it takes a heading as input and tries to capture two parts: the first is the content before the <a>
tag (variable name title
), which appears in the search results. The second captures the content of the href
attribute in the <a>
tag, which had to start with #
.
const headingContentRegex = /(.*?)<a.*? href="#(.*?)".*?>.*?<\/a>/i
The part that extracts something starting with #
was meant to extract anchors, and VitePress was automatically adding anchors to the end of headings. Here's an example of the HTML value of a heading from the Fedify documentation:
Implement the <a href="https://jsr.io/@fedify/fedify@1.6.2/doc/federation/~/KvStore"><code>KvStore</code></a> interface <a class="header-anchor" href="#implement-the-kvstore-interface" aria-label="Permalink to "Implement the `KvStore` interface""></a>
When applying the regex to our test case, it returns the following result. Looking at the second value in the returned array, we can see it only returned up to With
, and the third value correctly captured the anchor:
>>> /(.*?)<a.*? href="#(.*?)".*?>.*?<\/a>/i.exec('With <a href="https://example.com">a tag</a> heading <a href="#anchor"></a>')
Array(3) [ 'With <a href="https://example.com/">a tag</a> heading <a href="#anchor"></a>', "With ", "anchor" ]
To summarize the situation, the original intent was to capture the part before the <a>
tag at the end of the heading and the href
value of that <a>
tag, but it wasn't working properly.
The problem was with the ?
symbol in the first capture group (.*?)
. According to MDN's documentation on regular expressions, when the ?
symbol is used immediately after quantifiers like *
, +
, ?
, or {}
, it makes the quantifier non-greedy (matching the minimum number of times), as opposed to the default greedy behavior (matching the maximum number of times).
If used immediately after any of the quantifiers
*
,+
,?
, or{}
, makes the quantifier non-greedy (matching the minimum number of times), as opposed to the default, which is greedy (matching the maximum number of times).
Although I don't fully understand regular expressions, based on this quote, I understood that because it was non-greedy, the first capture group didn't take the maximum possible match of "With <a href="https://example.com">a tag</a> heading "
but only the minimum of "With "
. The rest was captured by the <a.*?
part. We can test this by adding parentheses to make it a capture group:
>>> /(.*?)<a(.*?) href="#(.*?)".*?>.*?<\/a>/i.exec('With <a href="https://example.com">a tag</a> heading <a href="#anchor"></a>')
Array(4) [ 'With <a href="https://example.com">a tag</a> heading <a href="#anchor"></a>', "With ", ' href="https://example.com">a tag</a> heading <a', "anchor" ]
So I needed to make the first capture group greedy again by simply removing one ?
symbol, and testing showed it worked well. Wow, bug fixed!
>>> /(.*)<a.*? href="#(.*?)".*?>.*?<\/a>/i.exec('With <a href="https://example.com">a tag</a> heading <a href="#anchor"></a>')
Array(3) [ 'With <a href="https://example.com">a tag</a> heading <a href="#anchor"></a>', 'With <a href="https://example.com">a tag</a> heading ', "anchor" ]
Creating a PR
It's time to submit a PR. When opening a PR, it's important to clearly explain the bug to the maintainers. For this purpose, I created two separate repositories and deployed them to GitHub Pages. One demonstrates the bug in the existing 1.6.3 version, and the other shows how removing the ?
symbol fixes the bug, confirming that the change in this PR resolves the issue.
I've attached a screenshot, but since it's difficult to capture everything, please check the full details in the issue 🙏
I spent about an hour preparing the PR, but it was merged almost instantly, which I appreciated for the quick review but also felt a bit awkward about 😅
Retrospective
What Went Well
- Although it was yak shaving, I enjoyed the process.
- I used the recently released Zed Debugger for debugging, which was a good experience.
What Didn't Go Well (Areas for Improvement)
- It was difficult to know if there were already existing issues or PRs addressing this bug.
- I spent quite a lot of time preparing the PR description.
- Setting up the reproduction environment took considerable time.
- Writing this retrospective also took about 2 hours.
- I didn't fully understand regular expressions, so I couldn't make a completely accurate assessment of the
<a>
tag issue.
Points to Improve
- Looking at this written process, it seems quite procedural - could this process be automated with an Agent?
- Could we make it easier to set up and deploy reproduction environments and modified environments?
- Is there an easier way to check if someone is already working on the same issue?
- It would be good to fully understand regular expressions.