Tag: timestream

  • Making a Simple WoW Token Tracker Faster

    Making a Simple WoW Token Tracker Faster

    Since this is around the second anniversary for wowtoken.app, I thought I should do an update detailing a lot of the work that’s been done between when I wrote the first article and now.

    Beyond the usual dependency updates and small UI improvements, most of the big changes have happened under the hood.

    Introducing Lambda@Edge

    For a long time, I have wanted to fully utilize the power of Functions as a Service, but originally pigeonholed myself into using us-east-1. The original rendition of the data providers that generate the responses for data.wowtoken.app (the API) were written in my favorite language, Ruby.

    This was done mainly because when I was originally learning how to use Lambda, I didn’t want to have to think about specific language abstractions, when Ruby’s come most natural to me.

    The issue is Ruby is kind of a second-class citizen on Lambda. They do support it as a runtime, but you are limited to 2.7, Ruby is only available in a handful of zones (at least, last I checked), and furthermore and most crucially is not available for use with Lambda@Edge.

    However, Python is a first-class citizen (along with NodeJS), so after a quick rewrite of the function into Python (and some much-needed refactoring), we were good to go.

    Now before we go any further, I should give a little background to Lambda@Edge if you are completely unfamiliar. Where Lambda is a scalable function in a single region, Lambda@Edge takes that function and runs it much closer to where your users are, at the CloudFront Points of Presence closest to them.

    There are a few advantages to this, the main one being speed in the form of much reduced latency.

    While the speed of light in fiber is fast, it takes a noticeable amount of time for a packet of information to make a trip there and back. Even across the US it can be impactful for some applications (think online games and VoIP), and that’s under best case scenario. Including last-mile situations such as DSL and cellular, that can add even more overhead. All that gets multiplied when TCP does it’s handshake. As well as SSL. Very quickly what is only 150ms ping time can become seconds of waiting for a page to load.

    The solution to this (most of the time, I am not talking about the edge case of you wanting to talk to your friend directly on the opposing side of the globe), is to move your compute closer to your users.

    Lambda@Edge makes this easy, and with one click of a button you can deploy your function worldwide.

    But moving the compute closer to the end user was only half the battle. As it stood, all the data was still getting sourced from DynamoDB (and Timestream, but we will come back to it later).

    DynamoDB Global Tables

    DynamoDB had long been my backing for both the current token price and more recent historical data as it’s performance was more than adequate for this application. One of the neat features it has is the easy ability to set up what is essentially eventually consistent Active-Active replication.

    With one click of a button in the console, you can replicate a DynamoDB table to any region you have access to. This makes the other part of the puzzle, data locality, very easy for the >90% of requests going to it.

    With the two puzzle pieces now in our hands, we needed to combine them to make the magic happen.

    The solution I came up with for this is a simple hash map, which I hand mapped regions to the closest Dynamo Global Table, with a failover to the most likely-to-be-closest table if the region the function is running in isn’t in the map. It’s not elegant and I would like to do another pass at it, but it works.

    There is no real reason to run Dynamo Global Tables in more than a handful of regions unless you have specific constraints to take into consideration (the main one I can think of is wanting to reduce intra-region data transfer costs if you are moving a lot of data – this app is not).

    Currently I am utilizing several replicas in the US, as well as Paris, Sydney, Tokyo, Stockholm, and Sao Paulo. I feel like this covers most scenarios where traffic for this app would be coming from, and means the latency from Lambda → DynamoDB should never be more than 50ms for most regions, and should be much lower than that for a vast majority of my users.

    Global Timestream

    While I wish Timestream was as easy as a click of a button to make it a global service, it still has a long way to mature. This means replication will have to be handled by your application in one way or another. Otherwise, it does what it is meant to do pretty well, but with the other caveat of the long response time to a query, even for “in memory” data. So for the historical data, I took a multi-pronged approach.

    As seen from it’s great job responding to the current price, DynamoDB has an excellent response time – at least for the use case of a web browser requesting data – and was identified as a good candidate to be used in addition to Timestream for a hybrid approach.

    Given most people are looking at the the last few days, up to a month or so, a new “historical” DynamoDB table was created to hold the previous few weeks of data. This ended up working so well, the expiry column has been updated from a few weeks, to a month, to three, and now to a full year of data (though this is still being filled).

    This means most requests will benefit from the data locality of many DynamoDB Global Table replicas, but that still leaves the older historical data, whose response size would not fit into a single DynamoDB response page, nor is stored in DynamoDB.

    For a long while, it had been served from a single location, a Timestream database in us-east-1. However, I was running into exceptionally poor response time for large queries that happened from locations that were not in the Americas. Multiple seconds of waiting, which makes for a very poor experience. While this was somewhat alleviated by upping the Max-Age in the cache, it was a band-aid at best, as cache misses still have to wait.

    The proper solution to this is to have multiple Timestream replicas in multiple regions. Like I alluded to in the first paragraph of this section, Timestream has no support for replication built in, so it had to be implemented on the application layer.

    Unfortunately, Timestream is (still) only available in a handful of regions, though thankfully those are fairly geographically diverse. In order to bring these up to speed with the original though, required backfilling data, which until pretty recently, was not super trivial on Timestream (they used to only let you backfill up to what you had your “in memory” table section had, this restriction has since been removed).

    This involved retrieving all the data from the current table (which required some clever Timestream queries to get it back in it’s original form, otherwise responses were too big for Timestream), turning on the EventBridge trigger for the new regions so they can start storing the latest data, and writing a simple script to slowly backfill the rest of the data.

    Once the other regions were caught up, we could now enable traffic to them, implementing a similar solution to the DynamoDB smart-routing.

    Top Speed?

    We are reaching the point of chasing tails, and by that I mean tail latencies. CloudFront caches the static assets making up the actual web page (HTML, CSS, and JS) for a long time, as well as the Lambda generated responses for as long as we can be sure the data is still relevant, but what happens if you hit a cold CDN cache?

    Absolute worst case for Lambda is a cold start, where the function hasn’t run recently at that PoP and has to be initialized first. The only real way around this is either keeping them warm using something along the lines of provisioned concurrency (honestly not even sure if that’s possible with Lambda@Edge – probably not; shows how much I weighed this solution EDIT: it’s definitely not a thing), or a lighter function a-la CloudFront Functions (which are Javascript only and even more limited). While I do utilize CloudFront Functions for some things unrelated to this, they were too limited for what I needed to do.

    This didn’t seem like a tree worth barking up though as the cold-start latency for a Lambda@Edge function seems to be at worst one second. The one function where speed really mattered (the current cost) would be called often enough from any viewer where it was highly unlikely to be cold for most PoPs, and the function where it didn’t matter as much (the actual price history data) is cached for longer on the CDN anyways.

    If, however, you are asking for the static assets, worst case scenario is it would have to reach out to the S3 bucket in us-east-1 and pass the response to you while warming it’s cache. That’s still a negative experience, ideally the site should be loaded and making API calls in less than a second.

    This is the area I have been actively looking at solving as I write this. The most likely candidate is using a S3 Multi-Region Access Point with buckets distributed to key regions to reduce TTFB latency for cold-cache hits. This works in conjunction with CloudFront to source the static assets from a location much closer to the requester.

    The main blocker to implementing this now is figuring out how to integrate it with the build system, but that’s a problem for future me. For now, I can just increase the Max-Age of the cache to keep the assets in for longer, and utilize wildcard invalidations when I make API breaking or high priority updates (which is very rare).

    If you read this far, thank you for sticking along, and have a cookie. I’ll likely have more updates in the future detailing future engineering efforts, but as it was a year and a half since my first post, I wouldn’t expect them too often.

  • Developing a Simple WoW Token Tracker

    Developing a Simple WoW Token Tracker

    I have been playing World of Warcraft, the hugely popular MMORPG from Blizzard Entertainment, for years now, as I enjoy the social and team effort aspects of raiding the hardest content in the game. Having spent so long in the game, I have touched basically every corner the game has to offer. One of the main aspects of the game is the economy, driven by gold.

    Nearly everything you do in the game generates or spends gold. Being able to capitalize on any of the markets can garner you a lot of gains, and having the ability to visualize various aspects can allow you to make data-driven decisions where you’d be basically blind without it.

    A few expansions back, Blizzard implemented the WoW Token system in the game to try and help cut down on the real-money trading gold sellers that nearly all MMOs have to deal with. They decided the best way was to become the first-party market for both sides of the transaction, and they take 25% on top.

    The result of this means that people who don’t have time to grind out gold in game can spend $20 and get a fluctuating amount of gold, or vice-versa, where you can spend gold and either redeem it for game time (as normally it costs ~$15/mo to play), or for $15 in Blizzard balance that you cane spend on other Blizzard properties.

    Given that they essentially created a market, the price of the token is driven by the supply and demand of each side of the equation. There are both daily fluctuations in price, and longer-term moves in the daily average. Having an informed decision on when to make a token transaction can mean the difference in tens to hundreds of thousands in gold spent or gained.

    Blizzard is one of the few game developers that offer a rich API to export information about the game world. They have lots of interesting bits of information available, and one of those is the current value of that WoW Token.

    The old site that I used to keep track of this died out of the blue. I am not sure if the domain expired, or what, but I decided this would be a good time to develop my own to my liking, and a fun way to get into using the API.

    Design

    The goal was simple. Display the current price of the token, with an arbitrary amount of history available to view. I wanted as little infrastructure as possible, because I didn’t want this site to die or to have to worry about servers for it. Given my previous experience with AWS, it was the natural choice to deploy there, and at the time they had just released their Timestream service to the public which was a perfect fit for this.

    A block diagram showing the different components of wowtoken.app.
    Visual Overview of wowtoken.app

    A time-series database by itself wont do anything by itself though, so AWS Lambda was chosen to drive the dynamic bits, and DynamoDB for the current price (as polling Timestream takes significantly more time compared to DynamoDB). EventBridge was chosen to trigger the Lambdas to update the price in DynamoDB and Timestream.

    The token price itself only updates a few times an hour, but the timing is fairly arbitrary, so once a minute was chosen as a poll interval for the current price, and once every 5 minutes for the Timestream record.

    I wanted to write the front end nearly from scratch, to be fairly lightweight, and just call a back end API to actually get the data. In the style of a Single Page App (SPA), but for this I didn’t need anything like React or Vue, as there needed to be only a few functions to handle getting the price and the current cost. CloudFront and S3 were chosen to front the static files, and CloudFront plus Lambda and API Gateway to handle the generated API responses.

    This decision was made for a two-fold reason:
    1) While the site was still in it’s infancy, I was unsure as to the amount of traffic it would get, and didn’t want to pay for an EC2 instance if it wasn’t being used at all.
    2) I wanted this to be very scalable. Again, I was unsure as to the traffic patterns and didn’t want it to buckle as it got more traffic, and there was no reason it should have to hit the origin more than once a minute even under very high load.

    Lambda itself is nearly infinitely horizontally scalable, as it will spin up more instances as it gets more traffic, but having a CloudFront distribution in front of it allowed responses to be cached for a configurable amount of time, this matching the poll rate of the the upstream API. API Gateway allows for a configurable routing and authorization for the API, and graceful fallback if for whatever reason the Lambda failed to process.

    Worst-case scenario would be individual users all hitting different CloudFront servers, and those requests being forwarded to Lambda, and best case is they just hit the CloudFront server and the request doesn’t make it back to the origin.

    In theory, this means the site gets faster as more people reach it. Responses should get cached closer to the users, and the Lambdas themselves stay “warm”, meaning fewer slow cold starts of a fresh Lambda instance.

    Where possible, I use Ruby for my projects, as that’s what I enjoy writing the most. Lambda has native support for Ruby 2.5 and 2.7.

    Implementation

    The Battle.net API requires authentication, and is rate limited in the amount of requests per second and per hour you can hit it. This particular application should never run into the rate limiting, but since I have other projects in the works that use the API, the first step was to develop a API gateway for my application (not to be confused with the Amazon API Gateway service).

    I wanted to stick the Unix philosophy of do one thing and do it well, so separating the authentication and rate limiting of the API to another function was the first step. One day I will write up about that project, but it’s currently being rewritten to a point that I am comfortable releasing it to the masses. Once it is, it will be available on GitHub here under the MIT license.

    Next was developing the ingestion functions. I decided to do them separately, as the the time series data did not need to be the same time granularity as the current price, it was there to show trends, and since the token price only updated a few times an hour, 5 minute granularity was chosen to save costs. These functions are rather simple, as they take the data in one form and just spit it into DynamoDB if the price is different than the current, or in Timestream, regardless if the price changed.

    Long term plans include analysis of the data, and in that case it’s better in this case to have a lot of more detailed data that can be reduced down if necessary, than to try and interpolate data that isn’t there.

    Next was the “backend” data APIs, one to return the current price of the token from DynamoDB, and another to retrieve the last x hours from Timestream. These again are separate functions, and this was chosen so the requests could be made in parallel, as well as having different caching policies for each function. Running a Timestream query takes a bit longer than returning four keys from DynamoDB, and I wanted to be sure to render the current price as quickly as possible, whereas I was okay with the Timestream data taking a bit longer to return.

    As an aside, when initially developing for Lambda, there was a learning curve and some frustration with the fact it felt like I could only verify the results of the execution by running it on Lambda itself. Sorta like a mainframe. However, since these were rather simple functions, the easiest way was to just unit test the method itself. The testing feature within the API gateway console itself is also useful to check what you uploaded ran correctly.

    The main page itself was the easiest part, I had more trouble picking the colors I wanted over writing the functions that dealt with the data APIs. The biggest snag here was taking the output of my API into a format that Chart.js liked, but that was more due to me getting muddied down in the Chart.js documentation than anything else.

    Deployment is easy, as each piece is separate, I just have CodePipelines that pull from GitHub when a component is updated. If there are no external dependencies and can be deployed as-is, such as the static front end, it’s immediately copied to S3 and that’s it. For the Lambdas that require dependencies, these are built and deployed via the Serverless Application Repository. Updates to the live site happen in less than 10 minutes in general.

    Things I Learned

    This project acted as a jump board to the rest of the Battle.net API. I am actively building out my next project, wowauctions.app, which is a spiritual successor to this. I learned a lot about how to build an SPA that interacts with Lambda, and this next project is being built with React.

    I had only minimally interacted with Javascript before (basically dodged it as much as I could), so forcing myself to use it here taught me a lot about it, and helped me dive straight into React for my next project after I gathered the basics.

    It taught me a lot about the various TSDBs offered, which I already wrote about here.

    Getting things out the door is the hardest though. It’s easy to get invested for 75% of the project, only to put it down as soon as you are super close to finishing, and it was nice to see my original vision to the end.

    The Future of the Project

    There are a few things I want to add to the site, but it will take some refactoring of the the Lambdas. Right now, the maximum you can look at is the previous 7 days, but that’s purely for aesthetic reasons. I want to come up with a good way to bucket the data where you can still get an idea of the daily min/max, but also the daily moving average for historical data.

    Right now, the data that gets returned from the function is basically every step or every time the price changed. Originally, it was returning all of the data, but this caused the graph to be very sharp. I wanted a smoother plot so the trend in price could be reckoned. However, if I apply this to 6 months, the amount of data looks quite bad represented in the graph as-is. I am still working on the best solution to this.

    I also want to add add some light statistics about the data presented. Kind of like a stock or cryptocurrency has the current price compared to 24 hours ago.

    I definitely want to explore using Lambda@Edge or similar to reduce latency for end users. Even shaving milliseconds off the response time is worth it. There is no reason this setup can’t be replicated to other regions.

    Beyond that, I am undecided. The original goal was to be simple and fast. I don’t need to add unnecessary bloat. However, given it only costs a few dollars at most to run per month, there is little reason to ever let this die, at least as long as the game data APIs for World of Warcraft exist.

  • Choosing a Time Series Database for my Projects

    Choosing a Time Series Database for my Projects

    While purpose-built Time Series Databases (TSDBs) have been around for a while, they’ve been surging in popularity recently as people are realizing the value of data-over-time. I’ve been working on some personal projects that make heavy use of these databases, and I took some of the most popular for a spin so you can learn from my mistakes.

    Most coding projects start with research – I know what my requirements are, so how am I going to fulfill them. When this initially began, my requirements were simple. I wanted to store the value of the World of Warcraft Token in the 4 different regions the main Blizzard API serviced. This had limited amounts of updates, the value changing at most every 15 minutes.

    InfluxDB

    InfluxDB is ranked highly on Google, and to their credit, it’s exceptionally easy to get started. Install, point your browser at a host:port combo and you are off to the races.

    InfluxDB sources screen, showing various libraries and integrations
    InfluxDB sources screen
    InfluxDB example code generated for Ruby within the web interface
    Point and click for all the most popular languages and Telegraf ingestion points, I wish more tools did this

    I had a prototype of what I wanted working in less than a day, and let me visualize the data I was storing in the Explore section with ease. However, when it came time to move this from development to initial deployment I ran came across the first roadblock. Replication/HA. Making sure my app is alive even if AWS AZ USE-1b is down. InfluxDB OSS edition allows no easy way to do this, which I guess is their way of getting you to use their paid software. There is a proxy that replicates writes available, but the implementation is up to you and there is no easy path to recovering a failed node.

    For the token project, I had to go with a different TSDB, but fast-forward a few months I decided to pay InfluxDB a revisit. The next project I was working on had less availability requirements, which meant the first roadblock wasn’t a factor. I liked the rapid prototyping and immediate feedback with the visual Data Explorer page, so I forged ahead.

    InfluxDB has a huge limitation I had seen the red flags of, but didn’t think I was going to run into. Due to the way the data is stored on disk and in memory, InfluxDB is great for data that doesn’t have high cardinality, and can suffer when there are a lot of unique data points. This is easy to see, here, here, and here and examples of where this point is stressed. The tokens, as an example, had 4 different combinations of keys for the values represented. My new project had many, many more. My lower-bounds estimate for the uniquely tracked points is around 1.7 million.

    This is how you strangle an InfluxDB instance. Turns out, if Data Explorer tried to enumerate all the possible keys, the entire thing would lock up. systemctl stop influxdb would hang for a bit, before killing the process. I had to evaluate alternatives here too.

    InfluxDB might be great for other use cases, but for the data I was tracking, it was not the play.

    AWS Timestream

    One of Amazon’s most recently launched products (as of this post, about 6 months ago), Timestream offers the benefits of being on the cloud, such as pay-as-you-go, and serverless architecture. As far as I am aware, this is the first of the big cloud services to offer a hosted TSDB, that’s not 3rd party software (for example TimescaleDB offers a cloud service that runs their database on a public cloud).

    Other than data retention and splitting your data into Databases and Tables, the options are sparse. You get some ingestion and query graphs, and that’s about it.

    Timestream manages the rest for you, including multi-AZ replication and high availability. It’s fairly easy to get started ingesting data (though I will say the data structure is kind of funky), and if you are familiar with SQL, it’s easy to query the data too.

    Amazon highly suggests that you optimize your queries heavily to return the least amount of data necessary for your operation, namely to save on costs and bandwidth. And saving costs where you can is a must, because it’s pretty expensive. Not unreasonable for what you are getting (native multi-AZ replication, configurable in-memory or disk storage, auto-scaling for queries and ingestion), but far outside of my budget for the larger data set.

    It was a perfect fit for the WoW Token website. The amount of data being stored was fairly small, and I didn’t want to host any servers for this project. I wanted it to be essentially autonomous, running on native AWS services. The database for this project costs me under a dollar a month to run, with most of the cost being my horribly un-optimized queries for what I am doing.

    However, when I attempted to use Timestream for my other project the true costs of the service started to show. Since the AWS bill takes a few days to populate with new information, I ran it for 2 days to get an estimate to the monthly bill.

    AWS Bill showing a very high cost to using Timestream for large amounts of data

    Yikes over $40 in 2 days, I did not want to be putting at least $500 a month towards a personal project I don’t even know if I will make a cent off of. I was not going to be able to use Timestream for this project.

    TimescaleDB

    Right below InfluxDB in the Google results sat software called TimescaleDB. When I made my first initial pass over the available software, I put this as low priority because it required the most steps get going with. I had to setup a Postgres database, and then install this on top as a library. But, after realizing how much Timestream would be, it began to look a lot more attractive.

    I deploy where I can on AWS’s Graviton2 ARM instances, both because it’s the best price/performance in general in the EC2 lineup, and because I believe ARM has a stronger future and I want to support that by not using x86 where I can. InfluxDB had prebuilt ARM packages for Ubuntu, but as of when I last installed it TimescaleDB did not. There is an open issue for those, but I had to compile it from source, which worked fine.

    Once I rewrote my workers to dump to what was essentially just a Postgres database, I was delighted to see excellent performance for the amount of data I was dumping into it. This was the choice for my larger project.

    As my workers were off to the races, I started exploring what other neat features Timescale offered. The first to catch my eye was the compression feature, which boasted possible compression ratios of over 90% for related data.

    It’s important to read the documentation carefully around the segmentby and orderby options in order to optimally compress your data. They do a far better job explaining the mechanisms behind it then I would ever hope to, but once I write up more about that project I’ll give more details about my specific usage as a real-world example.

    It was easy to add compression as a migration for Rails, but be mindful it’s easier to compress than decompress, and the schema is essentially set once it is compressed, so it’s a one-way street in Rail’s eyes. I am sure you could add up/down migrations to automatically decompress chunks, but that was far too much for what I needed to do.

    The compression ratios they claim are not wrong. It’s neat to see my database drop significantly in disk usage every 7 days (my chosen interval) as the compression task gets run for the last week. It looks like a big saw.

    A graph of disk usage from LibreNMS showing the effects of compression.
    A see-saw of disk usage

    I have yet to reprovision the size of my EBS volume as the compression has kept it relatively small. The hot data is stored on SSDs, which while performant, also start to get expensive as you go up in size. I will probably double the size of the volume once or twice it gets to near full, but beyond that TimescaleDB has another feature to deal with less-used data.

    It has the ability to move chunks of the (hyper)table to a slower medium of storage, like spinning disks. Running statements against this hybrid table is seamless, with relevant data pull from the slower drives where requested. I have yet to implement it on my project, given I haven’t needed to use it and it’s still a bit of a manual process. This allows you to free up space on your hot data drive, while maintaining that older, still-important data.


    There are a few other TSDBs I came across, some open source, and others not, but for the data I am handling, I am very happy with the choices I made. Even if I didn’t go with all of them, it was a learning experience nonetheless.

    Timestream is great for not dealing with servers and for the little amount of data I store on it, it’s very cheap compared to the rest of my AWS bill. But if you start to load it with a lot of data, it’s fairly expensive.

    TimescaleDB on the other hand requires me to maintain a Ubuntu install with all the headaches included with that, but it’s exceptional performance, feature set, and it just being a layer on top of Postgres allows basically any ORM to be able to use TimescaleDB. I pay under $100/mo for this versus the $500+ estimated for Timestream.

    InfluxDB was easy to get started with, but falls to it’s knees when you are dealing with high-cardinality data. I am sure it has it’s place, it seems to be very popular for monitoring systems where you don’t have 1.7 million+ possible combinations, but for those projects, it was not the right choice.