Thursday, April 19, 2012

Latency distributions

Say that we want to model the latency distribution of the following system.  Task 0 performs task 1.  Task 1 performs tasks 2, 3, 4, and 5 concurrently, and task 3 performs task 6 and 7 concurrently.



Since we are somewhat sensitive to long latencies, we'll set a timeout of 100 (ms).  Based on data we've collected about each task, we'll model each task latency as a log-normal distribution, with σ = 1 and μ = 0. That gives us about 10ms @ 99%.



The question here: what does the distribution of latencies for the whole system look like?  The function d samples from a task's latencies.


function () { return(d() + max(d(), d() + max(d(), d()), d(), d())) }


After some simulation, we get



The system has 24ms @ 99% latency.  Our system has 2.4x the latency of one of its local tasks.

How does concurrency affect latency distributions?  With R, the simulation is a little slow.  Here are the results for 1-10 concurrent calls.


The work is quadratic, so we'd like to go much faster.  Let's try out Julia.

It took just a few minutes to implement the simulation in Julia.

Our Julia-based simulation runs at least 10x faster.  That'll allow us to look at a lot more data.  Here's a plot in R of data simulated in Julia.


As the Central Limit Theorem tells us, the function is logarithmic:
Residuals:
    Min      1Q  Median      3Q     Max 
-3.2829 -1.0077 -0.1158  0.7845  6.1596 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8672     0.5922   8.219 8.63e-13 ***
log(d0$X)     7.5954     0.1578  48.133  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.457 on 98 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.959 
F-statistic:  2317 on 1 and 98 DF,  p-value: < 2.2e-16 


With Julia's performance, our simulations can be more than 10 times richer, which allows us to get verifiable results for even moderately complex systems.

This just in: the Julia simulation for up to 500 concurrent tasks.


Julia is very fast!

Saturday, March 10, 2012

SxSW Proton Flux

Yesterday I flew from SJC to AUS with many other folks heading to SxSW.


During the flight, I thought a bit about proton flux.  We watch space weather from time to time, and this week has been one of those times.


The sun erupted with one of the largest solar flares of this solar cycle on March 6, 2012 at 7PM EST. This flare was categorized as an X5.4, making it the second largest flare -- after an X6.9 on August 9, 2011 -- since the sun’s activity segued into a period of relatively low activity called solar minimum in early 2007. The current increase in the number of X-class flares is part of the sun’s normal 11-year solar cycle, during which activity on the sun ramps up to solar maximum, which is expected to peak in late 2013.
About an hour later, at 8:14 PM ET, March 6, the same region let loose an X1.3 class flare. An X1 is 5 times smaller than an X5 flare.
These X-class flares erupted from an active region named AR 1429 that rotated into view on March 2. Prior to this, the region had already produced numerous M-class and one X-class flare. The region continues to rotate across the front of the sun, so the March 6 flare was more Earthward facing than the previous ones. It triggered a temporary radio blackout on the sunlit side of Earth that interfered with radio navigation and short wave radio.
In association with these flares, the sun also expelled two significant coronal mass ejections (CMEs), which are travelling faster than 600 miles a second and may arrive at Earth in the next few days. In the meantime, the CME associated with the X-class flare from March 4 has dumped solar particles and magnetic fields into Earth’s atmosphere and distorted Earth's magnetic fields, causing a moderate geomagnetic storm, rated a G2 on a scale from G1 to G5. Such storms happen when the magnetic fields around Earth rapidly change strength and shape. A moderate storm usually causes aurora and may interfere with high frequency radio transmission near the poles. This storm is already dwindling, but the Earth may experience another enhancement if the most recent CMEs are directed toward and impact Earth.

NOAA reported

Space Weather Message Code: WARPX1
Serial Number: 347
Issue Time: 2012 Mar 09 2233 UTC
EXTENDED WARNING: Proton 10MeV Integral Flux above 10pfu expected
Extension to Serial Number: 346
Valid From: 2012 Mar 07 0030 UTC
Now Valid Until: 2012 Mar 11 0000 UTC
Warning Condition: persistence
Predicted NOAA Scale: S1 - Minor
Potential Impacts: Radio - Minor impacts on polar HF (high frequency) radio propagation resulting in fades at lower frequencies.
Is the Earth really getting hit by a bunch of high-energy protons (and other particles) while I'm at 37,000 feet?

Yes:








At its peak, the proton flux was 10,000x normal.  These storms can mess with HF communications, be hazardous to astronauts, and damage spacecraft.  Some airlines reroute polar flights to avoid communications disruptions.  Lloyds cares about space weather. High energy protons can cause mutations.

The magnetosphere and atmosphere provide shielding, so what's it like where I'm at -- at 37,000 feet?  Should consider sr, secondary effects, and probably many other aspects. But it's likely that we got hit with many more solar protons than we would have a week earlier or later.  Should have taken some video with the lens covered up.

Friday, September 9, 2011

Space weather update



Exciting news in space weather today, so we pulled some recent data to get set up for follow-on analysis.  Here's a quick look at magnetic declination at Boulder, CO for the last 30 days.  47,521 points.  Blue to red is August 8 to September 8.  Note that on August 9, which happens to be the blue outlier in the afternoon, the Sun released an X-class solar flare.

The data was collected at magnetic observatories. We thank the national institutes that support them and INTERMAGNET for promoting high standards of magnetic observatory practice (www.intermagnet.org).





In this plot, the Y-axis facets are day of the year.

Tuesday, August 23, 2011

GDP Uncertainty



This post is really about data representation, not economics.

Several commentators have noted the sharp drops in the Bureau of Economic Analysis estimates of the rate of change in the U. S. GDP in and around Q4 2008.  For example, the Economist observed:
The BEA’s first estimate of output in the fourth quarter of 2008, published in January of 2009, showed a contraction of 3.8%, later revised to a 6.8% drop. The new numbers change the figure yet again, to a shocking 8.9% fall in GDP. For 2009 as a whole, the American economy shrank by 3.5% rather than the previously reported 2.6%.

Such tardy and substantial changes to the basic picture of the downturn have left many perplexed. The fault lies in the grindingly slow process of government data collection. The BEA pieces together its GDP estimates from a range of monthly economic surveys. Those data, themselves subject to annual revisions, are fed into calculations of national output. Delays plague each step of the process. The 2009 Annual Survey of Manufactures, for instance, was published at last in the fourth quarter of 2010. Its impact on GDP was not revealed until this July, however, because the BEA reports annual revisions just once a year.
In its advance (earliest) release, the BEA did caution its audience regarding what "advanced" means when talking about a release.
The Bureau emphasized that the fourth-quarter “advance” estimates are based on source data that are incomplete or subject to further revision by the source agency (see the box on page 4).  The fourth-quarter “preliminary” estimates, based on more comprehensive data, will be released on February 27, 2009.
By the way, that "box on page 4" is interesting.  It reports that
Information on the assumptions used for unavailable source data is provided in a technical note
that is posted with the news release on BEA's Web site. Within a few days after the release, a detailed "Key Source Data and Assumptions" file is posted on the Web site. In the middle of each month, an analysis of the current quarterly estimates of GDP and related series is made available on the Web site; click on Survey of Current Business, "GDP and the Economy."


That's a lot to digest to attempt to understand the caveats (assuming you understand the methodology, which happened to change in 2009).  Note that the assumption details are posted "within a few days" of the release itself.

Models outside the BEA appear to have headed toward what was apparently a more accurate estimate.  At least some are bound to.  One analyst consensus estimated the GDP rate at -5.4 percent.  When markets closed on January 30, 2008, the Financial Times had this to say :

The S&P 500 closed down 2.3 per cent at 825.88, the Dow Jones Industrial Average 1.8 per cent lower at 8,000.86 and the Nasdaq Composite index off 2.1 per cent at 1,476.42.

The market had opened higher after US Department of Commerce figures showed that fourth quarter gross domestic product contracted at a 3.8 per cent annual rate, which, although bleak, was not as bad as feared.
But leading indices slipped into negative territory after the open as analysts pointed out that the headline figure – helped by the number of unwanted unsold goods – belied more worrying underlying trends.
The BEA's subsequent 2008:Q4 "preliminary" (second) release on February 27, 2009 reported a -6.2 percent annualized (and seasonally adjusted) rate of change from the prior period.  That was down from -3.8 percent in the advanced release 28 days prior.  The preliminary number was almost twice the advanced (earlier) number.  The BEA commented:

The preliminary estimate of the fourth-quarter change in real GDP is 2.4 percentage points, or $74.4 billion, lower than the advance estimate issued last month.  The downward revision to the percent change in real GDP was widespread; the largest contributors were downward revisions to private inventory investment, to exports, and to personal consumption expenditures for nondurable goods.
Note the relationship to the Financial Times comments a month earlier.  The S&P 500 was off 2.4% that day, and the yield curve steepened, with the 2-year note down 6 basis points and the 10-year note up 4 basis points.

The BEA's GDP estimation obviously isn't easy or immediate,   In certain circumstances, estimating quarterly GDP is especially hard, and subsequent BEA estimates of 2008:Q4 GDP did not really stabilize (presumably partly due to changes in methodology).  Here's the timeline for BEA updates to the 2008:Q4 GDP rate of change:

BEA 2008:Q4 GDP change releases


That's a long time to wait for market- and policy-moving data.  (Perhaps you can argue that much of the change was due to methodological changes, but where does that gets you?)  Was 2008:Q4 a fluke?  Here's the story for the BEA GDP estimates for the four quarters of 2008:

BEA 2008 quarterly GDP change releases 


2008 was not a good year for BEA GDP estimates.  Models often depart from reality at inconvenient times.  The BEA discusses some of these challenges here.  Changes in methodology complicate things.

For forecasts, error bars and other indicators of confidence, variance, distribution, etc., are of course common.  Here's a particularly relevant example from the Federal Reserve, which has its own staff for tracking this stuff:

Source: Federal Open Market Committee
This graph comes directly from the minutes of the Federal Open Market Committee on January 27-28, 2009, and the authors depict the "central tendency" for each forecast period.  Even though 2008 was history, it's obvious now that those recent historical numbers were not themselves history but instead forecasts of the past.  Presenting them as single numbers can encourage a confidence that is not always justified.  2008 GDP estimates deserved "central tendency" bars too.

Cheap shot: What were the error bars on S&P's Lehman counter-party credit risk rating that was reaffirmed in July 2008?

We try to be circumspect when we encounter point estimates.  In our projects, we often go to a fair amount of trouble to carry around information about distributions, confidence, or caveats.  Rarely convenient but often powerful.  We'll have more to report on this topic.


Tuesday, June 14, 2011

Part of HandlerSocket's Missing Manual

Yoshinori Matsunobu's excellent HandlerSocket has some undocumented features, including filters and a kind of IN. Here's a preliminary sketch that follows the HandlerSocket documentation.

Extended 'open':
  P <indexid> <dbname> <tablename> <indexname> <columns> [<fcolumns>]
<fcolumns> has the same syntax as <columns>. These filter columns are used by <filter> specifications, which are described below.

Extended 'find':
  <indexid> <op> <vlen> <v1> ... <vn> <limit> <offset> [<in>] <filter>*

  <in> := @ <icol> <ilen> <iv1> ... <ivn>

  <filter> := W|F <fop> <fcol> <fv>
<in> specifies that <v<icol>> should be sequentially replaced with each <ivi>. For example
  1 = 2 . foo 10 0 @ 0 3 6 7 8
(where . denotes a 0-length value) will result in the following three queries:
  1 = 2 6 foo
  1 = 2 7 foo
  1 = 2 8 foo
It appears that each query returns at most one record, so, if you provided N <iv>s, you'll get no more than N records. <filter> specifies when to skip or stop results.
  W = stop
  F = skip

  <fop>: one of =, <, >, =<, =>, =!.
  <fcol>: index for <fcolumns>
  <fval>: an encoded value
Examples:
bash$ echo 'CREATE TABLE foo (x INT, y VARCHAR(8), PRIMARY KEY (x,y));' \
      | mysql
bash$ nc localhost 9999
P    1    test foo  PRIMARY   x,y
0    1
1    +    2    1    one
0    1
1    +    2    2    two
0    1
1    +    2    3    three
0    1
1    +    2    4    four
0    1
1    >    1    0    10   0
0    2    1    one  2    two  3    three     4    four
P    1    test foo  PRIMARY   x,y  x
0    1
1    >    1    0    10   0    F    <    0    3
0    2    1    one  2    two
1    >    1    0    10   0    F    <    0    3
0    2    1    one  2    two
1    >    1    0    10   0    F    >    0    3
0    2    4    four
1    >    1    0    10   0    W    <    0    3
0    2    1    one  2    two
1    >    1    0    10   0    W    >    0    3
0    2
1    =         1    10   0    @    0    2    2    4
0    2    2    two  4    four
1    =         1    10   0    @    0    3    2    4    6
0    2    2    two  4    four
1    <=        1    10   0    @    0    3    2    4    6
0    2    2    two  4    four 4    four
The above is just a quick sketch. Let us know where we are off.

Tuesday, March 29, 2011

NTSB Aviation Accident Feed Updated

Just a quick note. We updated our NTSB Aviation Accident RSS feed. Various datasources had changed, so we had to change as well. Briefly, this feed provides links, when possible, to FlightAware, SkyVector, historical weather data, and more.

Our feed is here, and the official source is here.

Wednesday, September 8, 2010

Geomagnetic storm data




For a recent project, we looked at geomagnetic storm data. Space weather is making the news these days.


Lots of interesting data is available. Among other things, we found ourselves looking at the relationships between F and H, Dst, and Kp.



For example, the figure below represents F (blue), H (red), and Dst (green) during a geomagnetic storm on October 28-30, 2003. Each X tick is one minute. Dst values are interpolated.



The following figure presents the absolute value of percentage changes from the means.



The figure below represents the percentage change in F in the same timeframe:



Our review of the literature didn't turn up anything definitive on the relationship between F and H (or Dst and Kp). Anybody have any pointers? We've begun some analysis of data like


Source: http://geomag.usgs.gov/realtime/


but we'd obviously rather not build models from scratch if we don't have to. We'll make another pass at the literature and update this post accordingly. So stay tuned for exciting space weather analysis for the upcoming season.