It dones"t matterquot什么意思思

bazaarvoice: engineering | The official blog of Bazaarvoice R&D
Preparing for the Holiday season is a year round task for all of us here at Bazaarvoice.
This year we saw many retailers extending their seasonal in-store specials to their websites as well. We also saw retailers going as far as closing physical stores on Thanksgiving (Nordstrom, Costco, Home Depot, etc.) and Black Friday (REI).
Regardless of which of the above strategies were taken,
the one common theme amongst retailers
was the increase in online sales.
This trend is not new. Online sales are catching up to in stores sales () over the holiday season.
Along with the demand in online sales was the increase in demand on the Bazaarvoice network.
So here are just a few of the metrics that the Bazaarvoice network saw in the busiest week of online shopping:
Unique Visitors throughout
Pageviews and Impressions
So how does the Bazzarvoice team prepare the Holiday Season?
As soon as the online traffic settles from the peak levels, the R&D team begins preparing for the next
year’s Holiday Season.
First by looking back at the numbers and how we did as a team through various retrospectives. Taking inventory of what went well and what we can improve upon for the next year. Before you know it the team gets together in June to being preparations for the next years efforts. I want to touch on just a few of the key areas the team focused on this past year to prepare for a successful Holiday Season:
Client communication
Disaster Recovery
Load/Performance Testing
Freeze Plan
Client Communication
One key improvement this year was client communication both between R&D and other internal teams as well as externally to clients. This was identified as an area we could improve from last year.
Internally a response communication plan was developed. This plan makes sure that key representatives in R&D and support teams were on call at all times and everyone understands escalation paths and procedures should an issue occur. It was then the responsibility of the
on call representative to communicate any needs with the different engineers and client support staff. The on call period lasted from November 24th to Tuesday December the 1st.
A small focused team was identified for creation and delivery of all client communication.
As early as August, “Holiday Preparedness” communications were delivered to clients informing them of our service level objectives. Monthly client communications followed containing load target calculations, freeze plans, disaster recover preparations, as well as instructions on how to contact Bazaarvoice in the event of an issue as well as how we would communicate current status of our network during this critical holiday season.
Internally there was also an increased emphasis on the creation and evaluation of runbooks. Runbooks are ‘play by play’ instructions which engineers should carry out for different scenarios. The collection of procedures and operations were vital in the teams disaster recovery planning.
Disaster Recovery
To improve our operational excellence, we needed to ensure our teams were conducting exploratory disaster scenario testing to know for certain how our apps/service behaved and improve our Dev Ops code, monitoring/alerting, runbooks, etc.
Documenting the procedures was completed in the late summer.
That quickly moved into evaluating our assumptions and correcting where necessary.
All teams were responsible for:
documentation the test plan
documentation of the results
capture the MTTR (mean time to recovery) when appropriate
Sign off was required for all test plans and results shared amongst the teams.
We also executed a full set of Disaster Recovery scenarios and performed additional Green Flag fire drills to ensure all systems and personnel were prepared for any contingencies during the holiday season.
Load/Performance Testing
For an added layer of insurance, we pre scaled our environment ahead of the anticipated holiday load profile.
Analysis of 3 years of previous holiday traffic showed a predictable increase of approximately 2.5x the highest load average over the past 10 months. For this holiday season we tested at 4x the highest load average over that time period to ensure we were covered. The load test allowed the team to test beyond expected target traffic profile to ensure all systems would execute above expected levels.
Load testing initially was isolated per each system.
Conducting tests in such environment helped quickly identify any failure points. As satisfactory results were obtain, complexities were introduced by running systems in tandem. This simulated a environments more representative of what would be encountered in the holiday season.
One benefit experienced through this testing was the identification and evaluation of other key metrics to ensure the systems are operating and performing successfully. Also, a predictive model was created to evaluate our expected results.
The accuracy of the daily model was within 5% of the expected daily results and overall, for the 2015 season, was within 3%. This new model will be a essential tool when preparing for the next holiday season.
Freeze Plan
Once again, we locked down the code prior to the holiday season. Limiting the number of ‘moving parts’ and throughly testing the code in place increased our confidence that we would not experience any major issues.
As the image below demonstrates, two critical time periods were identified:
critical change freeze – code change to be introduced only if sites were down.
general change freeze – priority one bug fixes were accepted. Additional risk assessments performed on all changes.
As you can see the critical times coincide with the times we see increased online traffic.
A substantial amount of the work was all completed in the months prior to Black Friday and Cyber Monday. The team’s coordinated efforts prior to the holiday season ensured that our client’s online operations ran smoothly.
Over half of the year was spent ensuring performance and scalability for these critical times in the holiday season.
Data, as far back as three years, was also used to predict web traffic forecasts and ensure we would scale appropriately. This metric perspective also provided new insightful models to be used in future year’s forecasts.
The preparation paid off, and Bazaarvoice was able to handle 8.398 Billion impressions over Black Friday thru Cyber Monday (11/27-11/30), a new record for the our network.
This entry was posted in
Every year Bazaarvoice R&D throws BVIO, an internal technical conference followed by a two-day hackathon. These conferences are an opportunity for us to focus on unlocking the power of our network, data, APIs, and platforms as well as have some fun in the process. We invite keynote speakers from within BV, from companies who use our data in inspiring ways, and from companies who are successfully using big data to solve cool problems. After a full day of learning we engage in an intense, two-day hackathon to create new applications, visualizations, and insights into our extensive our data.
Continue reading for pictures of the event and videos of the presentations.
This year we held the conference at the palatial Omni Barton Creek Resort in one of their well-appointed ballrooms.
Participants arrived around 9am (some of us a little later). After breakfast, provided by Bazaarvoice, we got started with the speakers followed by lunch, also provided by Bazaarvoice, followed by more speakers.
After the speakers came a “pitchfest” during which our Product team presented hackathon ideas and participants started forming teams and brainstorming.
Finally it was time for 48 hours of hacking, eating, and gaming (not necessarily in that order) culminating in project presentations and prizes.
Presentations
Sephora: Consumer Targeted Content
Director of Architecture & Devops @
Venkat presented on the work Sephora is doing around serving relevant, targeted content to their consumers in both the mobile and in-store space. It was a fascinating speech and we love to see our how our clients are innovating with us. Unfortunately due to technical difficulties we don’t have a recording
Philosophy & Design of The BV System of Record
John Roesler & Fahd Siddiqui
Bazaarvoice Engineers
This talk was about the overarching design of Bazaarvoice’s innovative data architecture. According to them there are aspects to it that may seem unexpected at first glance (especially not coming from a big data background), but are actually surprisingly powerful. The first innovation is the separation of storage and query, and the second is choosing a knowledge-base-inspired data model. By making these two choices, we guarantee that our data infrastructure will be robust and durable.
Co-Founder and CTO at OneSpot
Ian has built and manages a team of world-class software engineers as well as data scientists at OneSpot(TM)s. In his presentation he discusses how he applied machine learning and game theory to architect a sophisticated realtime bidding engine for OneSpot(TM) capable of predicting the behavior of tens of thousands of people per second.
New Amazon Machine Learning and Lambda architectures
Amazon Solutions Architect
In his presentation Jeff discusses the history of Amazon Machine Learning and the Lambda architecture, how Amazon uses it and you can use it. This isn’t Ian walks us through the AWS UI for building and training a model.
Thanks to Sharon Hasting, Dan Heberden, and the presenters for contributing to this post.
This entry was posted in , , ,
When I started on the Firebird team at Bazaarvoice, I was happy to learn that they host their code on GitHub and review and land changes via pull requests. I was less happy to learn that they merged pull requests with the . I was able to convince the team to try out a new, rebase-oriented, workflow that keeps the mainline branch linear and clean. While the new workflow was a hit with the team, it was much more complicated than just clicking a button, so I automated the workflow with a simple git extension, , which we have released as an open source tool.
What’s Wrong With the Big Green Button?
The big green button is the “Merge pull request” button that GitHub provides to merge pull requests. Clicking it prompts the user to enter a commit message (or accept a default provided by GitHub) and then confirm the merge. When the user confirms the merge, the pull request branch is merged using the
option, which always creates a merge commit. Finally, GitHub closes the pull request.
For example, given a master branch like this:
…and a feature branch that diverges from the second commit:
…this is the result of doing a –no-ff merge:
Merging with the big green button is
for detailed discussions of why this is, see
and . In addition to the problems with merge commits that Isaac and Benjamin point out, the big green button has another downside: it merges the pull request without an opportunity to squash commits or otherwise clean up the branch.
This causes a couple of problems. First, because only the pull request author can clean up the PR branch, merging often became a tedious and drawn out process as reviewers cajoled the author to update their branch to a state that would keep `master`’s history relatively clean. Worse, sometimes messy pull requests were hastily or mistakenly merged.
As a result, the team was encouraged to keep their pull requests squashed into one or two clean commits at all times. This solved one problem, but introduced another: when an author responds to comments by pushing up a new version of the pull request, the latest changes are squashed together into one or two commits. As a result, reviewers had to hunt through the entire diff to ensure that their comments were fully addressed.
An Alternate Workflow
After some lively discussion, the team adopted a new workflow centered on fast-forward merging squashed and rebased pull request branches. Developers create topic branches and pull requests as before, but when updating their pull request, they never squash commits. This preserves detailed history of the changes the author makes in response to review feedback.
When the PR is ready to be merged, the merger interactively rebases it on the latest master, squashes it down to one or two commits, and does a fast-forward merge. The result is a clean, linear, and atomic history for `master`.
One hiccup is that GitHub can’t easily tell that the rebased and squashed commit contains the changes in the pull request, so it doesn’t close the PR automatically. Fortunately, GitHub . So, the merger has a final task: adding “[closes #&PR number&]” to one of the squashed commit’s message.
The biggest downside to the new workflow is that it transformed merging a PR from a simple operation (pushing a button) to a somewhat tricky multi-step process:
update local master to latest from upstream
check out pull request branch
do an interactive rebase on top of master, squashing down to one or two commits
add “[closes #&PR number&]” to the last commit message for the most recent squashed commit
do a fast-forward merge of the pull request branch into master
push local master to upstream
This process was too lengthy and error-prone to be reliable unless automated. To address this problem, I created a simple git extension: . The Firebird team has been using this tool for a little over a year with very few problems. In fact, it has spread to other teams at Bazaarvoice. We are excited to release it as an open source tool for the public to use.
This entry was posted in
One of the chief promises of the cloud is fast scalability, but what good is snappy scalability without load prediction to match? How many teams out there are still manually switching group sizes when load spikes? If you would like to make your Amazon EC2 scaling more predictive, less reactive and hopefully less expensive it is my intention to help you with this article.
Problem 1: AWS EC2 Autoscaling Groups can only scale in response to metrics in CloudWatch and most of the default metrics are not sufficient for predictive scaling.
For instance, by looking at the
reference page we can see that Amazon SQS queues, EC2 Instances and many other Amazon services post metrics to CloudWatch by default.
From SQS you get things like NumberOfMessagesSent and SentMessageSize. EC2 Instances post metrics like CPUUtilization and DiskReadOps. These metrics are helpful for monitoring. You could also use them to reactively scale your service.
The downside is that by the time you notice that you are using too much CPU or sending too few messages, you’re often too late. EC2 instances take time to start up and instances are billed by the hour, so you’re either starting to get a backlog of work while starting up or you might shut down too late to take advantage of an approaching hour boundary and get charged for a mostly unused instance hour.
More predictive scaling would start up the instances before the load became business critical or it would shut down instances when it becomes clear they are not going to be needed instead of when their workload drops to zero.
Problem 2: AWS CloudWatch default metrics are only published every 5 minutes.
In five minutes a lot can happen, with more granular metrics you could learn about your scaling needs quite a bit faster. Our team has instances that take about 10 minutes to come online, so 5 minutes can make a lot of difference to our responsiveness to changing load.
Solution 1 & 2: Publish your own CloudWatch metrics
Custom metrics can overcome both of these limitations, you can publish metrics related to your service’s needs and you can publish them much more often.
For example, one of our services runs on EC2 instances and processes messages off an SQS queue. The load profil some messages can be handled very quickly and some take significantly more time. It’s not sufficient to simply look at the number of messages in the queue as the average processing speed can vary between 2 and 60 messages per second depending on the data.
We prefer that all our messages be handled within 2 hours of being received. With this in mind I’ll describe the metric we publish to easily scale our EC2 instances.
ApproximateSecondsToCompleteQueue = MessagesInQueue / AverageMessageProcessRate
The metric we publish is called ApproximateSecondsToCompleteQueue. A scheduled executor on our primary instance runs every 15 seconds to calculate and publish it.
private AmazonCloudWatchClient _cloudWatchClient = new AmazonCloudWatchClient();
_cloudWatchClient.setRegion(RegionUtils.getRegion("us-east-1"));
PutMetricDataRequest request = new PutMetricDataRequest()
.withNamespace(CUSTOM_SQS_NAMESPACE)
.withMetricData(new MetricDatum()
.withMetricName("ApproximateSecondsToCompleteQueue")
.withDimensions(new Dimension()
.withName(DIMENSION_NAME)
.withValue(_queueName))
.withUnit(StandardUnit.Seconds)
.withValue(approximateSecondsToCompleteQueue));
_cloudWatchClient.putMetricData(request);
In our CloudFormation template we have a parameter calledDesiredSecondsToCompleteQueue and by default we have it set to 2 hours (7200 seconds). In the Auto Scaling Group we have a scale up action triggered by an Alarm that checks whether DesiredSecondsToCompleteQueue is less than ApproximateSecondsToCompleteQueue.
"EstimatedQueueCompleteTime" : {
"Type": "AWS::CloudWatch::Alarm",
"Condition": "HasScaleUp",
"Properties": {
"Namespace": "Custom/Namespace",
"Dimensions": [{
"Name": "QueueName",
"Value": { "Fn::Join" : [ "", [ {"Ref": "Universe"}, "-event-queue" ] ] }
"MetricName": "ApproximateSecondsToCompleteQueue",
"Statistic": "Average",
"ComparisonOperator": "GreaterThanThreshold",
"Threshold": {"Ref": "DesiredSecondsToCompleteQueue"},
"Period": "60",
"EvaluationPeriods": "1",
"AlarmActions" : [{
"Ref": "ScaleUpAction"
Visualizing the Outcome
What’s a cloud blog without some graphs? Here’s what our load and scaling looks like after implementing this custom metric and scaling. Each of the colors in the middle graph represents a service instance. The bottom graph is in minutes for readability. Note that our instances terminate themselves when there is nothing left to do.
I hope this blog has shown you that it’s quite easy to publish your own CloudWatch metrics and scale your EC2 AutoScalingGroups accordingly.
This entry was posted in
At Bazaarvoice we use Dropwizard for a lot of our java based SOA services. Recently I upgraded our Dropwizard dependency from 0.6 to the newer 0.7 version on a few different services. Based on this experience I have some observations that might help any other developers attempting to do the same thing.
Package Name Change
The first change to look at is the new package naming. The new io.dropwizard package replaces com.yammer.dropwizard. If you are using codahale’s metrics library as well, you’ll need to change com.yammer.metrics to com.codahale.metrics. I found that this was a good place to start the migration: if you remove the old dependencies from your pom.xml you can start to track down all the places in your code that will need attention (if you’re using a sufficiently nosy IDE).
- com.yammer.dropwizard -& io.dropwizard
- com.yammer.dropwizard.config -& io.dropwizard.setup
- com.yammer.metrics -& com.codahale.metrics
Class Name Change
aka: where did my Services go?
Something you may notice quickly is that the Service interface is gone, it has been moved to a new name: Application.
- Service -& Application
Configuration Changes
The Configuration object hierarchy and yaml organization has also changed. The http section in yaml has moved to server with significant working differences.
Here’s an old http configuration:
port: 8080
adminPort: 8081
connectorType: NONBLOCKING
requestLog:
enabled: true
enabled: true
archive: false
currentLogFilename: target/request.log
and here is a new server configuration:
applicationConnectors:
- type: http
port: 8080
adminConnectors:
- type: http
port: 8081
requestLog:
appenders:
- type: console
- type: file
currentLogFilename: target/request.log
archive: true
There are at least two major things to notice here:
You can create multiple connectors for either the admin or application context. You can now serve several different protocols on different ports.
Logging is now appender based, and you can configure a list of appenders for the request log.
Speaking of appender-based logging, the logging configuration has changed as well.
Here is an old logging configuration:
enabled: true
enabled: true
archive: false
currentLogFilename: target/diagnostic.log
level: INFO
"org.apache.zookeeper": WARN
"com.sun.jersey.spi.container.servlet.WebComponent": ERROR
and here is a new one:
level: INFO
"org.apache.zookeeper": WARN
"com.sun.jersey.spi.container.servlet.WebComponent": ERROR
appenders:
- type: console
- type: file
archive: false
currentLogFilename: target/diagnostic.log
Now that you can configure a list of logback appenders, you can write your own or get one from a library. Previously this kind of logging configuration was not possible without significant hacking.
Environment Changes
The whole environment API has been re-designed for more logical access to different components. Rather than just making calls to methods on the environment object, there are now six component specific environment objects to access.
JerseyEnvironment jersey = environment.jersey();
ServletEnvironment servlets = environment.servlets();
AdminEnvironment admin = environment.admin();
LifecycleEnvironment lifecycle = environment.lifecycle();
MetricRegistry metrics = environment.metrics();
HealthCheckRegistry healthCheckRegistry = environment.healthChecks();
AdminEnvironment extends ServletEnvironment since it’s just the admin servlet context.
By treating the environment as a collection of libraries rather than a Dropwizard monolith, fine-grained control over several configurations is now possible and the underlying components are easier to interact with.
Here is a short rundown of the changes:
Lifecycle Environment
Several common methods were moved to the lifecycle environment, and the build pattern for Executor services has changed.
environment.manage(uselessManaged);
environment.addServerLifecycleListener(uselessListener);
ExecutorService service = environment.managedExecutorService("worker-%", minPoolSize, maxPoolSize, keepAliveTime, duration);
ExecutorServiceManager esm = new ExecutorServiceManager(service, shutdownPeriod, unit, poolname);
ScheduledExecutorService scheduledService = environment.managedScheduledExecutorService("scheduled-worker-%", corePoolSize);
environment.lifecycle().manage(uselessManaged);
environment.lifecycle().addServerLifecycleListener(uselessListener);
ExecutorService service = environment.lifecycle().executorService("worker-%")
.minThreads(minPoolSize)
.maxThreads(maxPoolSize)
.keepAliveTime(Duration.minutes(keepAliveTime))
ExecutorServiceManager esm = new ExecutorServiceManager(service, Duration.seconds(shutdownPeriod), poolname);
ScheduledExecutorService scheduledExecutorService = environment.lifecycle().scheduledExecutorService("scheduled-worker-%")
.threads(corePoolSize)
Other Miscellaneous Environment Changes
Here are a few more common environment configuration methods that have changed:
environment.addResource(Dropwizard6Resource.class);
environment.addHealthCheck(new DeadlockHealthCheck());
environment.addFilter(new LoggerContextFilter(), "/loggedpath");
environment.addServlet(PingServlet.class, "/ping");
environment.jersey().register(Dropwizard7Resource.class);
environment.healthChecks().register("deadlock-healthcheck", new ThreadDeadlockHealthCheck());
environment.servlets().addFilter("loggedContextFilter", new LoggerContextFilter()).addMappingForUrlPatterns(EnumSet.allOf(DispatcherType.class), true, "/loggedpath");
environment.servlets().addServlet("ping", PingServlet.class).addMapping("/ping");
Object Mapper Access
It can be useful to access the objectMapper for configuration and testing purposes.
ObjectMapper objectMapper = bootstrap.getObjectMapperFactory().build();
ObjectMapper objectMapper = bootstrap.getObjectMapper();
HttpConfiguration
This has changed a lot, it is much more configurable and not quite as simple as before.
HttpConfiguration httpConfiguration = configuration.getHttpConfiguration();
int applicationPort = httpConfiguration.getPort();
HttpConnectorFactory httpConnectorFactory = (HttpConnectorFactory) ((DefaultServerFactory) configuration.getServerFactory()).getApplicationConnectors().get(0);
int applicationPort = httpConnectorFactory.getPort();
Test Changes
The functionality provided by extending ResourceTest has been moved to ResourceTestRule.
import com.yammer.dropwizard.testing.ResourceT
public class Dropwizard6ServiceResourceTest extends ResourceTest {
protected void setUpResources() throws Exception {
addResource(Dropwizard6Resource.class);
addFeature("booleanFeature", false);
addProperty("integerProperty", new Integer(1));
addProvider(HelpfulServiceProvider.class);
import io.dropwizard.testing.junit.ResourceTestR
import org.junit.R
public class Dropwizard7ServiceResourceTest {
ResourceTestRule resources = setUpResources();
protected ResourceTestRule setUpResources() {
return ResourceTestRule.builder()
.addResource(Dropwizard6Resource.class)
.addFeature("booleanFeature", false)
.addProperty("integerProperty", new Integer(1))
.addProvider(HelpfulServiceProvider.class)
Dependency Changes
Dropwizard 0.7 has new dependencies that might affect your project. I’ll go over some of the big ones that I ran into during my migrations.
Guava 18.0 has a few API changes:
Closeables.closeQuietly only works on objects implementing InputStream instead of anything implementing Closeable.
All the methods on HashCodes have been migrated to HashCode.
Metric 3.0.2 is a pretty big revision to the old version, there is no longer a static Metrics object available as the default registry. Now MetricRegistries are instantiated objects that need to be managed by your application. Dropwizard 0.7 handles this by giving you a place to put the default registry for your application: bootstrap.getMetricRegistry().
Compatible library version changes
These libraries changed versions but required no other code changes. Some of them are changed to match Dropwizard dependencies, but are not directly used in Dropwizard.
Coursera Metrics-Datadog
Apache Curator
Amazon AWS SDK
Future Concerns
Dropwizard 0.8
The newest version of Dropwizard is now 0.8, once it is proven stable we’ll start migrating. Hopefully I’ll find time to write another post when that happens.
Thank You For Reading
I hope this article helps.
This entry was posted in
Today we are announcing two important changes to our Conversations API services:
Deprecation of Conversations API versions older than 5.2 (4.9, 5.0, 5.1)
Ending Conversations API service using custom domains
Both of these changes will go into effect on April 30, 2016.
Our newer APIs and universal domain system offer you important advantages in both features and performance. In order to best serve our customers, Bazaarvoice is focusing its API efforts on the latest, highest performing API services. By deprecating older versions, we can refocus our energies on the current and future API services, which we feel offer the most benefits to our customers. Please visit our
to learn more about the Conversations API, our API versioning, and the steps necessary to support the upgrade.
We understand that this news may be surprising. This is your first notification of this change. In the months and weeks ahead, we will continue to remind you that this change is coming.
We also understand that this change will require effort on your part. Bazaarvoice is committed to making this transition easy for you. We are prepared to assist you in a number of ways:
Pre-notification: You have 12 months to plan for and implement the change.
Documentation: We have specific
to help you.
Support: Our support team is ready to address any questions you may have.
Services: Our services teams are available to provide additional assistance.
In summary, on April 30, 2016, Conversations API versions released before 5.2 will no longer be available. Applications and websites using versions before 5.2 will no longer function properly after April 30, 2016. In addition, all Conversations API calls, regardless of version, made to a custom domain will no longer respond. Applications and websites using custom domains (such as “ReviewStarsInc.”) will no longer function properly after April 30, 2016. If your application or website is making API calls to Conversations API versions 4.9, 5.0 and 5.1 you will need to upgrade to the current Conversations API (5.4) and use the universal domain (“.”). Applications using Conversations API versions 5.2 and later (5.2, 5.3, 5.4) with the universal domain will continue to receive uninterrupted API service.
If you have any questions about this notice, please submit a case in . We will periodically update this blog and our developer Twitter feed () as we move closer to the change of service date.
Thank you for your partnership,
Chris Kauffman
Sr. Product Manager
This entry was posted in ,
A distributed data system consisting of several nodes is said to be fully consistent when all nodes have the same state of the data they own. So, if record A is in State S on one node, then we know that it is in the same state in all its replicas and data centers.
Full Consistency sounds great. The catch is the
theorem that states that its impossible for a distributed system to simultaneously guarantee consistency (C), availability (A), and partition tolerance (P). At Bazaarvoice, we have sacrificed full consistency to get an AP system and contend with an eventually consistent data store. One way to define eventual consistency is that there is a point in time in the past before which the system is fully consistent (full consistency timestamp, or FCT). The duration between FCT and now is called the Full Consistency Lag (FCL).
An eventually consistent system may never be in a fully consistent state given a massive write throughput. However, what we really want to know deterministically is the last time before which we can be assured that all updates were fully consistent on all nodes. So, in the figure above, in the inconsistent state, we would like to know that everything up to Δ2 has been replicated fully, and is fully consistent. Before we get down to the nitty-gritty of this metric, I would like to take a detour to set up the context of why it is so important for us to know the full consistency lag of our distributed system.
At Bazaarvoice, we employ an eventually consistent system of record that is designed to span multiple data centers, using multi-master conflict resolution. It relies on Apache Cassandra for persistence and cross-data-center replication.
One of the salient properties of our system of record is immutable updates. That essentially means that a row in our data store is simply a sequence of immutable updates, or deltas. A delta can be a creation of a new document, an addition, modification, or removal of a property on the document, or even a deletion of the entire document. For example, a document is stored in the following manner in Cassandra, where each delta is a column of the row.
Δ1 { “rating”: 4,
“text”: “I like it.”}
Δ2 { .., “status”: “APPROVED” }
Δ3 { .., “client”: “Walmart” }
So, when a document is requested, the reader process resolves all the above deltas (Δ1 + Δ2 + Δ3) in that order, and produces the following document:
{ “rating”: 4,
“text”: “I like it.”,
“status”: “APPROVED”,
“client”: “Walmart”,
“~version”: 3 }
Note that these deltas are stored as key-value pairs with the key as Time UUID. Cassandra would thus always present them in increasing order of insertion, making sure the last-write-wins property. Storing the rows in this manner allows us massive non-blocking global writes. Writes to the same row from different data centers across the globe would eventually achieve a consistent state without making any cross-data center calls. This point alone warrants a separate blog post, but it will have to suffice for now.
To recap, rows are nothing but a sequence of deltas. Writers simply append these deltas to the row, without caring about the existing state of the row. When a row is read, these deltas are resolved in ascending order and produce a json document.
There is one problem with this: over time rows will accrue a lot of updates causing the row to become really wide. The writes will still be OK, but the reads can become too slow as the system tries to consolidate all those deltas into one document. This is where compaction helps. As the name suggests, compaction resolves several deltas, and replaces them with one “compacted” delta. Any subsequent reads will only see a compaction record, and the read slowness issue is resolved.
Great. However, there is a major challenge that comes with compaction in a multi-datacenter cluster. When is it ok to compact rows on a local node in a data center? Specifically, what if an older delta arrives after we are done compacting? If we arbitrarily decide to compact rows every five minutes, then we run the risk of losing deltas that may be in flight from a different data center.
To solve this issue, we need to figure out what deltas are fully consistent on all nodes and only compact those deltas, which basically is to say, “Find time (t) in the past, before which all deltas are available on all nodes”. This t, or full consistency timestamp, assures us that no deltas will ever arrive with a time UUID before this timestamp. Thus, everything before the full consistency timestamp can be compacted without any fear of data loss.
There is just one issue. This metric is absent in out of the box AP systems such as Cassandra. To me, this is a vital metric for an AP system. It would be rare to find a business use case in which permanent inconsistency is tolerable.
Although Cassandra doesn’t provide the full consistency lag, we can still compute it in the following way:
Tf = Time no hints were found on any node
rpc_timeout = Maximum timeout in cassandra that nodes will use when communicating with each other.
FCT = Full Consistency Timestamp
FCL = Full Consistency Lag
FCT = Tf – rpc_timeout
FCL = Tnow – FCT
The concept of Hinted Handoffs was introduced in Amazon’s dynamo paper as a way of handling failure. This is what Cassandra leverages for fault-tolerant replication. Basically, if a write is made to a replica node that is down, then Cassandra will write a “hint” to the coordinator node and try again in a configured amount of time.
We exploit this feature of Cassandra to get us our full consistency lag. The main idea is to poll all the nodes to see if they have any pending hints for other nodes. The time when they all report zero (Tf) is when we know that there are no failed writes, and the only pending writes are those that are in flight. So, subtracting the cassandra timeout (rpc_timeout) will give us our full consistency lag.
Now, that we have our full consistency lag, this metric can be used to alert the appropriate people when the cluster is lagging too far behind.
Finally, you would want to graph this metric for monitoring.
Note that in the above graph we artificially added a 5 minute lag to our rpc_timeout value to avoid excessively frequent compactions. We periodically poll for full consistency every 300 seconds (or 5 minutes). You should tweak this value according to your needs. For our settings above, the expected lag is 5 minutes, but you can see it spike at 10 minutes. All that really says is there was one time when we checked and found a few hints. The next time we checked (after 5 minutes in our case) all hints were taken care of. You can now set an alert in your system that should wake people up if this lag violates a given threshold–perhaps several hours–something that makes sense for your business.
This entry was posted in
Categories

我要回帖

更多关于 quot什么意思 的文章

 

随机推荐