mindflakes

Oct 13, 2022 - 1 minute read

GitHub Wiki SEE: Search Engine Enablement: Year One

It’s been near a year since I supercharged GitHub Wiki SEE with dynamically generating sitemaps.

Since then a few things have changed:

  • GitHub has started to permit a select set of wikis to be indexed.
  • They have not budged on letting un-publically editable wikis be indexed.
  • There is now a star requirement of about 500, and it appears to be steadily decreasing.
  • Many projects have reacted by moving or configuring their wikis to indexable platforms.

As a result, traffic to GitHub Wiki SEE has dropped off dramatically.

stats

This is a good thing, as it means that GitHub is moving in the right direction.

I’m still going to keep the service up, as it’s still useful for wikis that are not yet indexed and there are still about 400,000 wikis that aren’t indexed.

Hopefully GitHub will continue to move in the right direction and allow all wikis to be indexed.

Oct 12, 2022 - 1 minute read

Openpilot Replay Clipper

End to End Longitudinal Control is currently an “extremely alpha” feature in Openpilot that is the gateway to future functionality such as stopping for traffic lights and stop signs.

Problem is, it’s hard to describe its current deficiencies without a video.

So I made a tool to help make it easier to share clips of this functionality with a view into what Openpilot is seeing, and thinking.

https://github.com/nelsonjchen/op-replay-clipper/

It’s a bit heavy in resource use though. I was thinking of making it into a web service but I simply do not have enough time. So I made instructions for others to run it on services like DigitalOcean, where it is cheap.

It is composed of a shell script and a Docker setup that fires up a bunch of processes and then kills them all when done.

Hopefully this leads to more interesting clips being shared, and more feedback on the models that comma.ai can use.

Oct 11, 2022 - 2 minute read

Datasette-lite sqlite-httpvfs experiment POC with the California Unclaimed Property database

screenshot

There’s a web browser version of datasette called datasette-lite which runs on Python ported to WASM with Pyodide which can load SQLite databases. I grafted the enhanced lazyFile implementation from emscripten and then from this implementation to datasette-lite relatively recently as a curious test. Threw in a 18GB CSV from CA’s unclaimed property records here

https://www.sco.ca.gov/upd_download_property_records.html

into a FTS5 Sqlite Database which came out to about 28GB after processing:

POC, non-merging Log/Draft PR for the hack:

https://github.com/simonw/datasette-lite/pull/49

You can run queries through to datasette-lite if you URL hack into it and just get to the query dialog, browsing is kind of a dud at the moment since datasette runs a count(*) which downloads everything.

Elon Musk’s CA Unclaimed Property

Still, not bad for a $0.42/mo hostable cached CDN’d read-only database. It’s on Cloudflare R2, so there’s no BW costs.

Celebrity gawking is one thing, but the real, useful thing that this can do is search by address. If you aren’t sure of the names, such as people having multiple names or nicknames, you can search by address and get a list of properties at a location. This is one thing that the California Unclaimed Property site can’t do.

I am thinking of making this more proper when R2 introduces lifecycle rules to delete old dumps. I could automate the process of dumping with GitHub Actions but I would like R2 to handle cleanup.

Apr 16, 2022 - 1 minute read

Released Gargantuan Takeout Rocket

Finally released GTR or Gargantuan Takeout Rocket.

https://github.com/nelsonjchen/gtr


Gargantuan Takeout Rocket (GTR) is a toolkit of guides and software to help you take out your data from Google Takeout and put it somewhere else safe easily, periodically, and fast to make it easy to do the right thing of backing up your Google account and related services such as your YouTube account or Google Photos periodically.


It took a lot of time to research and to workaround the issues I found in Azure.

I also needed to take apart browser extensions and implement my own browser extension to facilitate the transfers.

Cloudflare workers was also used to work around issues with Azure’s API.

All this combined, I was able to takeout 1.25TB of data in 3 minutes.

Now I’m showing it around on Twitter, Discord, Reddit, and more to users who have used VPSes to backup Google Takeout or have expressed dismay at the lack of options for users who are looking to store archives on the cheap. The response from people who’ve opted to stay on Google has been good!

Jul 28, 2021 - 3 minute read

Cell Shield

I built https://cellshield.info to scratch an itch. Why can’t I embed a spreadsheet cell?

You can click on that badge and see the sheet that is the data backing the badge.

It is useful for embedding a cell from an informal database that is a Google Spreadsheet atop of some README, wiki, forum post, or web page. Spreadsheets are still how many projects are managed and this can be used to embed a small preview of important or attention grabbing data.

It is a simple Go server that uses Google’s API to output JSON that https://shields.io can read. Mix it in with some simple VueJS generator front-end to take public spreadsheet URLS and tada.

I picked to base my implementation on shields.io’s as they make really pretty badges and have a lot of nice documentation and options.

I run it on Cloud Run to keep the costs low and availability high. It also utilizes Google’s implementation of authorization to get access to the Sheets API.

The badge generator web UI can also make BBCode for embedding (along with modern Markdown and so on).

I’ve already used it for this and that. The former outputs a running bounty total and the latter is community annotation progress.

A major Game Boy development contest is using it labels to display the prize pool amount: https://itch.io/jam/gbcompo21

Why I built this?

I built it for the Comma10K annotation project. I contributed the original badge which attempted to use a script that was used in the repo to analyze the completion progress. Unfortunately, this kept on diverging from the actual progress because the human elements kept changing the structure and organization and I had the badge removed. The only truth is what was on the spreadsheet.

Completion Status

Popularity

Compared to my other projects, I don’t think the popularity will be too good for this project though. I think it might have a chance to spread via word of mouth but it’s not everyday that a large public collaborative project is born with a Google Spreadsheet managing it. The name isn’t that great either as it probably conflicts with anti-wireless hysteria. Also, it’s not like there’s an agreed upon community of Google Spreadsheet users to share this knowledge with.

I also shamelessly plugged it here but I think the question asker might be deceased:

https://stackoverflow.com/questions/57962813/how-to-embed-a-single-google-sheets-cells-text-content-into-a-web-page/68505801#68505801

At least it was viewed 2,000 times over the last two years.

Maintenance

It’s on Cloud Run. The cost to keep it running is low as it sometimes isn’t even running and the Sheets V4 API it uses was recently declared “enterprise-stable”. Because of this, it will be super easy to maintain and cheap to run. I don’t expect any more than $2 a month.

Jul 27, 2021 - 9 minute read

GitHub Wiki SEE: Search Engine Enablement

I built this in March, but I figure I’ll write about this project better late than never. I supercharged it in June out of curiosity to see how it would perform. It’s just a ramble. I am not a good writer but I just wanted to write stuff down.

TLDR: I mirrored all of GitHub’s wiki with my own service to get the content on Google and other search engines.

If you’ve used a search engine to search up something that could appear on GitHub, you are using my service.

https://github-wiki-see.page/

Did you know that GitHub wikis aren’t search engine optimized? In fact, GitHub excluded their own wiki pages from the search results with their robots.txt. “robots.txt” is the mechanism that sites can use to indicate to search engines whether or not to index certain sections of the site. If you search on Google or Bing, you won’t find any results unless your search query terms are directly inside the URL.

The situation with search engines is currently this. For example, if you wanted to search for information relating to Nissan Leaf support of the openpilot self-driving car project in their wiki, you could search for “nissan leaf openpilot wiki”. Unfortunately, the search results would be empty and would contain no results. If you searched for “nissan openpilot wiki”, “https://github.com/commaai/openpilot/wiki/Nissan" would show up correctly because it has all the terms in it. The content of the GitHub wiki page is not used for returning good results.

It has been like this for at least 9 years. GitHub responded that they have declined to remove it from robots.txt as they believe it can be used for SEO abuse. I find this quite unbelievable. The mechanism to prevent this is to utilize rel="nofollow" on all user generated <a> links in the content and with newer guidance suggesting rel="nofollow ugc". By removing the content from the index, they have blinded a large portion of the web’s users to the content in GitHub wiki pages.

Until I made a pull request to add a notice to their documentation that wikis are not search engine optimized, there was absolutely no official documentation or notice about this decision. I still believe it is not enough as nothing in the main GitHub interface notes this limitation.

I believe there are many dedicated users and communities that are not aware of this issue who are diligently contributing to GitHub wikis. This really frustrated me. I was one of them. Many others and I contributed a lot to the [comma.ai openpilot wiki][op_wiki] and it was maddening that their work and my work were not indexed.

I noticed that a lot of technical questions had Stack Overflow mirrors rank highly and yet they weren’t Stack Overflow.

So what I did was this: I made a proxy site without a robots.txt that would mirror GitHub wiki content to a site that did not have a restrictive robots.txt. I put it up at https://github-wiki-see.rs. I called it GitHub Wiki SEE for GitHub Wiki Search Engine Enablement. It’s kind of a twisted wording of SEO as I see it as enabling and not even “optimizing”.

What’s the technology behind it?

The current iteration is a Rust Actix Web application. It parses the URL and makes requests out to GitHub to get the content. It then scrapes the content for relevant HTML and reformats it to be more easily crawl-able. It then serves the content to the search engine and users who come in from a search engine with a link to the project page and reasoning at the bottom as a static pinned element. Users aren’t meant to read that page and are only meant to browse to the original GitHub page.

With these choices, I could meet these requirements:

  • Hosted it on Google Cloud Run just in case it doesn’t become popular.
  • Made it easy to deploy.
  • Made it cheap.
  • Made it small to run on Cloud Run.
  • Made it low maintenance.

Another considering was that it needed to not get rate limited by GitHub. To do this, I made the server return an error and quit itself on Cloud Run if it detected a rate limiting response to cycle to a new server with a new client IP address. Unfortunately, this limited the number of requests each server could handle at a time to a minimum of one.

Supercharging it

I originally built the project in March. I monitored the Google Search Console like a hawk to see how the openpilot stuff I had contributed to was responding to the proxy setup. Admittedly, it got a few users to click through here and there and I was pretty happy. For a month or two, I was pretty satisfied. However, I kept on thinking about the rest of the users who lamented this situation. While some did take my offer to post a link somewhere on their blog, README, or tweet, the adoption wasn’t as much as I had hoped.

I had learned about GH Archive which is a project that archived every event of GitHub since about 2012. I also learned that the project also mirrors its archive on BigQuery. BigQuery allows you to search large datasets with SQL. Using BigQuery, I was able to determine that about 2,500 GitHub Wiki pages get edited every day. In a week, about 17,000. A month? 66,000 pages. My furor at the blocking of indexing increased.

I decided to throw $45 at the issue and run a big mega-query and queried all of GH Archive. Over the lifetime of GH Archive, 4 million pages were edited. I dumped this into a queue and processed it with Google Cloud Functions to run the HEAD query on all the pages in the list and re-exported it to BigQuery. 2 million pages were still there.

You can find the results of this in these public tables on BigQuery:

  • github-wiki-see:show.checked_touched_wiki_pages_upto_202106
    • Checked with HEAD
  • github-wiki-see:show.touched_wiki_pages_upto_202106

With this checked_touched_wiki_pages_upto_202106 table, I generated 50MB of compressed XML sitemaps for the GitHub Wiki SEE pages. I called them seed sitemaps. The source for this can be found in the sitemap generator repo.

BigQuery is expensive, so I made another sitemap series for continuous updates. An earlier version used BigQuery, but the new version runs a daily GitHub Actions job that downloads the last week of events from GH Archive, processes them for unique Wiki URLs and exports them to a sitemap. This handles any ongoing GitHub Wiki changes. The source for this can also be found in the sitemap generator repo.

The results

I am getting thousands of clicks and millions of impressions every day. GitHub Wiki SEE results are never at the top and average a position of about 29 or the third page. The mirrored may not rank high and I don’t care as the original content was never going to rank high anyway so the content appearing at all in the search results of thousands of users even if it was sort of deep satisfies me.

All in all, a pretty fun project. I am really happy with the results.

  • My contributions to the OpenPilot wiki are now indexed.
  • Everyone else’s contributions to their wikis are now indexed as well.
  • I learned a few things or two about writing a Rust web app.
  • I learned a lot about SEO
  • I feel I’ve helped a lot of people who need to find a diamond in the rough and the diamond isn’t in the URL.
  • I feel a little warm that I have a high traffic site.

I have no intention of putting ads on the site. It costs me about $25 a month to run but for the high traffic it gets, the peace of mind I’ve acquired, and the good feelings of helping everybody searching Google, it is worth it.

I did add decommissioning steps. If it comes to it, I’ll make the site just directly redirect but after GitHub takes Wikis off their robots.txt.

Things I Learned

This is a grab-bag of stuff I learned along the way.

  • Sitemaps are a great way to get search engine traffic to your site.
  • Large, multi-million page sitemaps can take many months to process. Consider yourself lucky if it parses 50,000 URLs a day.
    • The Google Search Console saying “failed to retrieve” is a lie. It means to say “processing” but this is apparently a long standing stupid bug.
  • GoogleBot is much more aggressive than Bing.
    • Google’s Index of my proxy site is probably a lot larger than Bing’s by a magnitude or two. For every 30, 40, or even 100 GoogleBot requests, I get one Bing request.
    • This is important for serving DuckDuckGo users as it is their index.
  • The Rust Tokio situation is a bit of a pain as I am stuck on a version of Tokio that is needed for stable Actix-Web. As a result, I cannot use modern Tokio stuff that modern Rust crates or libraries use. I may port it to Rocket or something in the future.
  • A lot of people search for certification answers and test answers on GitHub Wikis.
  • GitHub Wikis and GitHub are no exception to Rule 34.
  • Cloud Run is great for getting many IP addresses to use and to swap out if any die. The outgoing IP address is extremely ephemeral and changing it is a simple matter of simply exiting if a rate limit is detected.
    • This might not always work. Just exit again if so.
  • GCP’s UI is nicer for manual use compared to AWS.
  • The built-in logging capabilities and sink of GCP products are also really nice. I can easily re-export and/or stream logs to BigQuery or to a Pub Sub topic for later analysis. I can also just use the viewer on the UI to see what is going on.
  • I get fairly free monitoring and graphing from the GCP ecosystem. The default graphs are very useful.
  • Rust web apps are really expensive to compile on Google Cloud Build. I had to use 32-core machines to get it to compile in underneath the default timeout of 10 minutes. I eventually sped it up and made it cheaper to build by building on a pre-built image along with a smaller instance but it is still quite expensive.
  • The traffic from the aggressive GoogleBot is impressively large. Thankfully, I do not have to pay the bandwidth costs.
  • My sloppy use of the HTML Parser may be causing errors but RAM is fairly generous. It’s Rust too so I don’t really have to worry that much about garbage collection or anything similar.
  • Google actually follows links it sees in the pages. When I started, it did not. Maybe it does it only on more popular pages.
  • The Rust code is really sloppy, but it still works.

Sep 12, 2020 - 6 minute read

The short lived life of Breakout Bots for Zoom Meetings

With the pandemic happening, it’s been tough for many organizations to adapt. We’re all supposed to be together! One way organizations have been staying together is by using Zoom Meetings.

At work, we’ve been using a knock-off reseller version called RingCentral Meetings. From looking at their competitors, Zoom has pulled out all the stops to make their meeting experience the most efficient, resiliant, rather cheapest, and reliable experience.

One of these features is “Breakout Rooms”. With breakout rooms, users can subdivide their meetings to make mini-meetings from a bigger meeting. For most of the year, there has been an odd restriction on Breakout Meetings though.

You cannot go to another breakout room as an attendee without the host reassigning you unless you are a Host or a Co-Host.

Obviously, this can result in a lot of load upon the poor user who is desginated the host. Even the Co-Hosts can’t assign users to another breakout room.

So I made a bot:

https://github.com/nelsonjchen/BreakoutRoomsBotForZoomMeetings

the bot rename in action

It’s quite a hack, but it basically controls a web client that is a Host and assigns users based on chat commands and on attendees renaming themselves.

Last night, I just finished finally optimizing the bot. It should be able to handle hundreds of users and piles of chat commands.

This morning I learned that Zoom will add native support for switching:

https://www.reddit.com/r/Zoom/comments/irkm82/selfselect_breakout_room/

It was fun, but as brief as the bot existed, the main purposes are numbered. Maybe someone can reuse the code to make other Zoom Meeting add-on/bots?

Technical and Learnings

  • I knew about this issue for many months but I never made a bot because there wasn’t an API. I waited until I was very disatisified with the situation and then made the bot. I should have just done it, without the API.

  • Thankfully, by doing this quickly, I only really spent like a week of time on this total. With some of the tooling I used, I didn’t have to spend too long debugging the issues.

  • The original implementation of this tool was a copy and paste into Chrome’s Console. This is a much worse experience for users than a Chrome extension. Kyle Husmann contributed a fork that extensionized the bot and it is totally something I’ll be stealing for future hacks like this.

  • Zoom got to where they are by doing things their competitors did not and rather quickly. This feature/feedback response is great and is why they’ve been crushing. I feel this is a feature that would have taken their competitors possibly a year to implement. At this time, Microsoft Teams is also looking to implement Breakout Rooms.

  • Zoom’s web client uses React+Redux. They also use the Redux-Thunk middleware. They do stick some side-effects in places where things shouldn’t but whatever, it works. It did make integration very hard if not impossible at this layer for my bot. However, subscribing to state changes was and is very reliable.

  • RxJS is a nice implementation of the Reactive Pattern with Observables and stuff like with RX.NET. Without this, I could not fit the bot in so few lines of code with the reliability and usability that it currently has.

  • With RxJS and Redux’s store, I was able to subscribe to internal Redux state changes and transform them into streams of events. For example, the bot transform the streams of renames and chat message requests into a common structure that it then merges to be processed by later operators. Lots of code is shared and I’m able to rearrange and transform at will.

  • The local override functionality is great if you want to inject Redux Web Tools into an existing minified application. If you beautify the code, you can inject dev tools too. Here’s the line of code I replaced to inject Redux Dev Tools into Zoom’s web client.

    Previous:

    , s = Object(r.compose)(Object(r.applyMiddleware)(i.a));
    

    After:

    , s = window.__REDUX_DEVTOOLS_EXTENSION_COMPOSE__(Object(r.applyMiddleware)(i.a));
    

    This was super helpful in seeing how the Redux state changes when performing actions in a nice GUI.

  • fuzzysort is great! I’m quite surprised at how unsurprising the results can be when looking up rooms by a partial or cut up name.

  • The underlying websocket connection is available globally as commandSocket. If you observe the commands, they are just JSON and you can inject your own commands to control the meeting programatically.

  • Since Breakout Room status is sent back via WebSocket, the UI will automatically update. I did not have to go through the Redux store to update the Breakout Room UI.

  • UI Automation of Breakout Room UI: ~100ms vs webSocket command: 2ms. Wow!

  • Chat messages are encrypted and decrypted client-side before going out on the Websocket interface. This probably doesn’t stop Zoom from reading the chat messages since they are the ones who gave you the keys. I could have reimplemented the encryption in the bot but after running some replay attacks, the UI did not update. Without the UI updating, I didn’t feel it was safe for the host to not know what the bot sent out on their behalf.

  • Even if I had implemented the chat message encryption, the existing UI automation of chat messages hovered around 30ms. The profiler also showed that the overhead of automation wasn’t that much and much of the chat message encryption contributed to the 30ms. It wouldn’t have been much of a performance win if I implemented it myself.

  • Possible Security-ish Issue - Since it isn’t an acceptable bounty issue, and hosts can simply disable chat, I will disclose is possible to DOS the native client by simply spamming chat. If there is too much text, the native clients will lock up and require a restart. Native clients don’t drop old chat logs. Web clients don’t have this issue as they will drop old chat automatically. Last but not least, it is possible for the Host to disable chat in many ways to alleviate such an attack. I wasn’t sure if to let my load_test bot out there or not because of this but after reflection, I think it’s OK.

  • RxJS is MVP. Or rather the Reactive stuff is MVP. If you can get it into a stream, it’s super powerful. Best yet, you can reuse your experience from other ecosystems. I was previously using Rx.NET for some stuff at work. I’ll be definitely looking at using the Python version of this in the future.

    • Though it did seem the differing Reactive stuff have different operators. The baselines are still great though and you can always port operators from one to another implementation.
  • Rx is really helped if you can use TypeScript. Unfortuntately, I never got to this stage.

That said, this was fun! Now onward to the next thing.

Oct 2, 2019 - 1 minute read

Stop Sims 2 Purple Soup or Pink Flashing Crashes on Windows 10 and Modern Hardware with D9VK

Wrote a guide on how to fix the Purple Soup or Pink Flashing Crashes for The Sims 2 on Windows 10 1903+ on powerful gaming computers. It’s great to make this accessible again to gamers with powerful gaming computers.

Guide Link: https://docs.google.com/document/d/19JMr-FQSU3AlF7Kyvcrr7awTiHBDQSUNLG6qfaI6rOs/edit#

General idea: Replace Microsoft DirectX 9 with D9VK.

D9VK is actually actively developed and fixes bugs. It also translates graphics to a stack that is more open and stable.

So this:

sims 2 purple soup

Turns to this:

Fixed and proof of running

Jun 2, 2019 - 2 minute read

Fast and lightweight headless Qt Installer from Qt Mirrors: aqtinstall

As part of my work on the Barrier project, we needed a way to build it with the [Qt installer]from an online CI/CD service. On some platforms, such as Appveyor, Qt may be preinstalled. But on some other platforms such as Azure Pipelines, Qt may not be pre-installed. Nor on Travis. Nor on CircleCI. For Linux and macOS, there is a saving grace. The Qt libraries can be installed headlessly from their operating system repositories such as apt or brew. But this luxury does not exist as well on Windows. While Chocolatey does offer some Qt packages, they tend to ancient or non-official.

There is a toolkit called CuteCI that let you use the Qt Installer and script the GUI headlessly. It’s janky and big though. Additionally, to pin versions, you must download 3 to 5 GB of Qt Installer to install a pinned version of Qt.

So, I found this: https://lnj.gitlab.io/post/qli-installer/

Apparently it’s possible to mimic the Qt Installer from a Python script.

To save it from possible bit-rot, I forked it to GitHub. I added parallel downloading, Windows support, and a CI system. I pushed Azure-Pipelines and each commit could cause 100 jobs to run. Across all the various platforms. Azure Pipelines did not break a sweat.

CuteCI is still extremely useful if you must use the official installer with official support. But the speeds with my fork were insane. It would normally take many minutes to install Qt with CuteCI. My QLI-Installer fork? 10 or 20 seconds.

But it turns out there’s an even better installer than mine!

https://github.com/miurahr/aqtinstall/

aqtinstall

  • Classy reusable Python structure
  • On PyPi. You can do pip install aqtinstall and install aqtinstall for immediate use.

This is also another fork of qli-installer. The CI system was a in a bit of a sorry state though. So I contributed matrix generated jobs to test aqtinstall across Mac, Windows, and Linux. @miurahr was able to dial down the installation and testing madness too and ramp it up in some places. Actual Qt applications are built to test if aqtinstall work. Additionally, there was the insight that mobile platforms generally always require the latest Qt version.

So, if you want a fast and lightweight headless Qt installer that installs from Qt Mirrors and is continually tested, use aqtinstall!