Tag: cloudfront

  • Making a Simple WoW Token Tracker Faster

    Making a Simple WoW Token Tracker Faster

    Since this is around the second anniversary for wowtoken.app, I thought I should do an update detailing a lot of the work that’s been done between when I wrote the first article and now.

    Beyond the usual dependency updates and small UI improvements, most of the big changes have happened under the hood.

    Introducing Lambda@Edge

    For a long time, I have wanted to fully utilize the power of Functions as a Service, but originally pigeonholed myself into using us-east-1. The original rendition of the data providers that generate the responses for data.wowtoken.app (the API) were written in my favorite language, Ruby.

    This was done mainly because when I was originally learning how to use Lambda, I didn’t want to have to think about specific language abstractions, when Ruby’s come most natural to me.

    The issue is Ruby is kind of a second-class citizen on Lambda. They do support it as a runtime, but you are limited to 2.7, Ruby is only available in a handful of zones (at least, last I checked), and furthermore and most crucially is not available for use with Lambda@Edge.

    However, Python is a first-class citizen (along with NodeJS), so after a quick rewrite of the function into Python (and some much-needed refactoring), we were good to go.

    Now before we go any further, I should give a little background to Lambda@Edge if you are completely unfamiliar. Where Lambda is a scalable function in a single region, Lambda@Edge takes that function and runs it much closer to where your users are, at the CloudFront Points of Presence closest to them.

    There are a few advantages to this, the main one being speed in the form of much reduced latency.

    While the speed of light in fiber is fast, it takes a noticeable amount of time for a packet of information to make a trip there and back. Even across the US it can be impactful for some applications (think online games and VoIP), and that’s under best case scenario. Including last-mile situations such as DSL and cellular, that can add even more overhead. All that gets multiplied when TCP does it’s handshake. As well as SSL. Very quickly what is only 150ms ping time can become seconds of waiting for a page to load.

    The solution to this (most of the time, I am not talking about the edge case of you wanting to talk to your friend directly on the opposing side of the globe), is to move your compute closer to your users.

    Lambda@Edge makes this easy, and with one click of a button you can deploy your function worldwide.

    But moving the compute closer to the end user was only half the battle. As it stood, all the data was still getting sourced from DynamoDB (and Timestream, but we will come back to it later).

    DynamoDB Global Tables

    DynamoDB had long been my backing for both the current token price and more recent historical data as it’s performance was more than adequate for this application. One of the neat features it has is the easy ability to set up what is essentially eventually consistent Active-Active replication.

    With one click of a button in the console, you can replicate a DynamoDB table to any region you have access to. This makes the other part of the puzzle, data locality, very easy for the >90% of requests going to it.

    With the two puzzle pieces now in our hands, we needed to combine them to make the magic happen.

    The solution I came up with for this is a simple hash map, which I hand mapped regions to the closest Dynamo Global Table, with a failover to the most likely-to-be-closest table if the region the function is running in isn’t in the map. It’s not elegant and I would like to do another pass at it, but it works.

    There is no real reason to run Dynamo Global Tables in more than a handful of regions unless you have specific constraints to take into consideration (the main one I can think of is wanting to reduce intra-region data transfer costs if you are moving a lot of data – this app is not).

    Currently I am utilizing several replicas in the US, as well as Paris, Sydney, Tokyo, Stockholm, and Sao Paulo. I feel like this covers most scenarios where traffic for this app would be coming from, and means the latency from Lambda → DynamoDB should never be more than 50ms for most regions, and should be much lower than that for a vast majority of my users.

    Global Timestream

    While I wish Timestream was as easy as a click of a button to make it a global service, it still has a long way to mature. This means replication will have to be handled by your application in one way or another. Otherwise, it does what it is meant to do pretty well, but with the other caveat of the long response time to a query, even for “in memory” data. So for the historical data, I took a multi-pronged approach.

    As seen from it’s great job responding to the current price, DynamoDB has an excellent response time – at least for the use case of a web browser requesting data – and was identified as a good candidate to be used in addition to Timestream for a hybrid approach.

    Given most people are looking at the the last few days, up to a month or so, a new “historical” DynamoDB table was created to hold the previous few weeks of data. This ended up working so well, the expiry column has been updated from a few weeks, to a month, to three, and now to a full year of data (though this is still being filled).

    This means most requests will benefit from the data locality of many DynamoDB Global Table replicas, but that still leaves the older historical data, whose response size would not fit into a single DynamoDB response page, nor is stored in DynamoDB.

    For a long while, it had been served from a single location, a Timestream database in us-east-1. However, I was running into exceptionally poor response time for large queries that happened from locations that were not in the Americas. Multiple seconds of waiting, which makes for a very poor experience. While this was somewhat alleviated by upping the Max-Age in the cache, it was a band-aid at best, as cache misses still have to wait.

    The proper solution to this is to have multiple Timestream replicas in multiple regions. Like I alluded to in the first paragraph of this section, Timestream has no support for replication built in, so it had to be implemented on the application layer.

    Unfortunately, Timestream is (still) only available in a handful of regions, though thankfully those are fairly geographically diverse. In order to bring these up to speed with the original though, required backfilling data, which until pretty recently, was not super trivial on Timestream (they used to only let you backfill up to what you had your “in memory” table section had, this restriction has since been removed).

    This involved retrieving all the data from the current table (which required some clever Timestream queries to get it back in it’s original form, otherwise responses were too big for Timestream), turning on the EventBridge trigger for the new regions so they can start storing the latest data, and writing a simple script to slowly backfill the rest of the data.

    Once the other regions were caught up, we could now enable traffic to them, implementing a similar solution to the DynamoDB smart-routing.

    Top Speed?

    We are reaching the point of chasing tails, and by that I mean tail latencies. CloudFront caches the static assets making up the actual web page (HTML, CSS, and JS) for a long time, as well as the Lambda generated responses for as long as we can be sure the data is still relevant, but what happens if you hit a cold CDN cache?

    Absolute worst case for Lambda is a cold start, where the function hasn’t run recently at that PoP and has to be initialized first. The only real way around this is either keeping them warm using something along the lines of provisioned concurrency (honestly not even sure if that’s possible with Lambda@Edge – probably not; shows how much I weighed this solution EDIT: it’s definitely not a thing), or a lighter function a-la CloudFront Functions (which are Javascript only and even more limited). While I do utilize CloudFront Functions for some things unrelated to this, they were too limited for what I needed to do.

    This didn’t seem like a tree worth barking up though as the cold-start latency for a Lambda@Edge function seems to be at worst one second. The one function where speed really mattered (the current cost) would be called often enough from any viewer where it was highly unlikely to be cold for most PoPs, and the function where it didn’t matter as much (the actual price history data) is cached for longer on the CDN anyways.

    If, however, you are asking for the static assets, worst case scenario is it would have to reach out to the S3 bucket in us-east-1 and pass the response to you while warming it’s cache. That’s still a negative experience, ideally the site should be loaded and making API calls in less than a second.

    This is the area I have been actively looking at solving as I write this. The most likely candidate is using a S3 Multi-Region Access Point with buckets distributed to key regions to reduce TTFB latency for cold-cache hits. This works in conjunction with CloudFront to source the static assets from a location much closer to the requester.

    The main blocker to implementing this now is figuring out how to integrate it with the build system, but that’s a problem for future me. For now, I can just increase the Max-Age of the cache to keep the assets in for longer, and utilize wildcard invalidations when I make API breaking or high priority updates (which is very rare).

    If you read this far, thank you for sticking along, and have a cookie. I’ll likely have more updates in the future detailing future engineering efforts, but as it was a year and a half since my first post, I wouldn’t expect them too often.

  • WordPress: There and Back Again

    WordPress: There and Back Again

    I am horrible about settling with which blogging software to use. Over the years, I have gone from WordPress to Ghost, Jekyll to Hyde. Most recently I did a lot of work to make the jump from WordPress to Jekyll to reduce my hosting costs, only to realize the barrier to writing became higher, and that’s just not the hurdle my neurodivergent brain wanted when in a mood to write long-form.

    Love it or hate it WordPress has put a lot into making their WYSIWYG editor, actually, well, what you get. I personally like the blocks and composability of the post or page you are working on (when compared to the “classic” editor, as well as other WYSIWYG platforms like Drupal or Joomla, but that’s a discussion for another time).

    WordPress 2: The Electric Boogaloo feat. ActivityPub

    Okay so it’s not really my second install of WordPress, but I can say it’s my second one carrying the older content of this iteration (please don’t hurt me, I just wanted to use the subtitle).

    The thing that really pushed me over the edge though was the possibility to integrate WordPress with ActivityPub (via a plugin) and have it show on my timeline on Mastodon.

    While the discussion of the death of Twitter is for another post, I hope it is a catalyst to usher in an age of interoperability of various platforms under the umbrella of ActivityPub, and there are signs that it may be headed this way (beyond Mastodon’s explosive growth).

    Tumblr? I thought you were dead

    Tumblr was not a name I had in my 2022 bingo card, but Matt has been making moves. Back in 2019, Automattic acquired the Tumblr for a deal from Yahoo. Me, like many others, kinda just filed that news into the archives of our brains.

    “Huh, neat. I like Matt Mullenweg and Automattic. Hope they do good things with it.”

    – Me, 2019

    The Reports of my Death are Greatly Exaggerated

    And good things they did. Matt announced at the beginning of this month that they would be reversing the ban on nudity, and more recently that they would be adding support for ActivityPub. Obviously these were playing into the timing of the death of Bird App, but good moves nevertheless.

    Why I mention this in a post about WordPress though is I have a small inkling that they will be making ActivityPub a core part of WordPress. I hope so. This is just pure conjecture, and if they have already started work or announced they are going to, that’s pure coincidence with when I am writing this.

    I don’t mind a few plugins, but like to keep my install lean and would prefer to use functionality built into WordPress as there’s a support path going forward. A first-party plugin would also be acceptable as I can understand if not everyone want that social functionality built into their blog.

    But the Bots!

    Yep, I am fully aware of the fact that bots try all day long to get into WordPress sites, I see them in my CloudFront statistics trying to login to sites that are static files on S3.

    Bots trying their hardest but failing

    I wanted to come up with a solution that would allow me to access the actual server hosting the stack without issue, but for the public serve completely through a CDN to help with global distribution, and block access to the admin panel.

    The actual WordPress site is a Docker stack, configured through a docker-compose.yml file, completely self-contained. It’s fronted by Caddy, which handles the origin’s SSL.

    The tricky bit is “serving” the site using only the one hostname so I could access the origin directly just by setting a DNS override in my router to point at the actual origin, but not otherwise exposing the site to any strange passerby (i.e. not running it as the “default” site).

    I set Caddy to serve from the WordPress container only to the hostname – blog.emily.sh – and a second domain to serve a dummy blank page (only a single line in Caddy!) to make CloudFront happy and validate that dummy as the origin. This is so CloudFront is sending it’s requests to the IP that’s actually the origin, as it doesn’t accept naked IPs. You could alternatively use the ec2-generated hostname, but I apparently forgot to turn that on for the VPC subnet I deployed into, so any other domain pointing to that IP works just as well.

    An example Caddyfile, though I replaced the dummy domain with an example one so it’s not that easy to find the origin.

    blog.emily.sh {
            root * /var/www/html
            php_fastcgi wordpress:9000
            file_server
    }
    dummyhost.this.is.yours {
    	respond "Hello"
    }

    I created a Lambda@Edge function to rewrite the ‘Host’ header in flight before hitting the origin so Caddy knows to actually serve my blog instead of the dummy site. I wanted to write this as a CloudFront function directly, but they only allow you to attach those to Viewer Request and Viewer Response events, and I needed to rewrite the headers on the Origin Request event.

    For a sort of visual:
    Viewer Request → CloudFront → Lambda@Edge → Origin Request

    Et voilà, some obfuscation from the bots. Don’t get me wrong, it’s not perfect and easy enough to work around if you knew a certain IP was the origin, but good enough to cut down on a lot of the chatter, and allow me to completely bypass the CDN and still be secure. Security is about layers though, and this is just one of many (I know, I know, security through obscurity isn’t actual security but tell me how many less bots you get when you don’t run SSH on 22).

    A .well-known Aside

    This should pass through without issue, though I made a behavior for that route to not cache at all on the CDN. This way, when it comes time to renew the origin SSL cert, Caddy should have no issue, though I haven’t tested that flow yet. I do know ActivityPub correctly populates it’s .well-known entry though, so I do not think there will be any issue with SSL renewals.

    Likewise, I have a behavior for the admin route that simply requires any request to be signed with IAM credentials authorized to make that call on my AWS account. Since those are kinda hard to come by, I deem it as an acceptable way to deal with it. I could have written a Lambda to immediately return any given status code, or direct to S3 hosting a static page, but why pay for the compute and/or storage to make that happen. If someone that’s not me has access to my AWS account I have much bigger things to worry about than bypassing my CDN.

    There will be a tutorial on doing this for your blog soon! Just need to find the time to write it.