PhilBlog.com http://philblog.com Most recent posts at PhilBlog.com posterous.com Tue, 08 Mar 2011 09:30:00 -0800 A Declaration of Cyber-War | Culture | Vanity Fair http://philblog.com/a-declaration-of-cyber-war-culture-vanity-fai http://philblog.com/a-declaration-of-cyber-war-culture-vanity-fai
Media_httpwwwvanityfa_ffjot

Another fascinating read on Stuxnet.

"When Stuxnet moves into a computer, it attempts to spread to every machine on that computer’s network and to find out whether any are running Siemens software. If the answer is no, Stuxnet becomes a useless, inert feature on the network. If the answer is yes, the worm checks to see whether the machine is connected to a P.L.C. or waits until it is. Then it fingerprints the P.L.C. and the physical components connected to the controller, looking for a particular kind of machinery. If Stuxnet finds the piece of machinery it is looking for, it checks to see if that component is operating under certain conditions. If it is, Stuxnet injects its own rogue code into the controller, to change the way the machinery works. And even as it sabotages its target system, it fools the machine’s digital safety system into reading as if everything were normal.'

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Tue, 18 Jan 2011 11:29:00 -0800 Did a U.S. Government Lab Help Israel Develop Stuxnet? | Threat Level http://philblog.com/did-a-us-government-lab-help-israel-develop-s http://philblog.com/did-a-us-government-lab-help-israel-develop-s
Media_httpwwwwiredcom_ezjok

The Stuxnet story just keeps getting better.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Wed, 29 Dec 2010 12:25:00 -0800 Kinect Hacked to Play Full-Body World of Warcraft http://philblog.com/kinect-hacked-to-play-full-body-world-of-warc http://philblog.com/kinect-hacked-to-play-full-body-world-of-warc

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Wed, 22 Dec 2010 07:56:00 -0800 The Arizona motorcycle saddle - Boing Boing http://philblog.com/the-arizona-motorcycle-saddle-boing-boing http://philblog.com/the-arizona-motorcycle-saddle-boing-boing
Media_httpwwwboingboi_ddlqg

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Sun, 19 Dec 2010 07:56:00 -0800 And Now Presenting: Amazing Satellite Images Of The Ghost Cities Of China http://philblog.com/and-now-presenting-amazing-satellite-images-o http://philblog.com/and-now-presenting-amazing-satellite-images-o
Media_httpstaticbusin_bzeis

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Thu, 16 Dec 2010 20:40:00 -0800 Word Lens Translates Words Inside of Images. Yes Really. http://philblog.com/word-lens-translates-words-inside-of-images-y http://philblog.com/word-lens-translates-words-inside-of-images-y

Wow

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Tue, 14 Dec 2010 08:30:00 -0800 Facebook | Visualizing Friendships http://philblog.com/facebook-visualizing-friendships http://philblog.com/facebook-visualizing-friendships
Media_httpexternalakf_bjmie

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Mon, 13 Dec 2010 08:43:00 -0800 Raw Video: Snow Causes Metrodome Roof Collapse http://philblog.com/raw-video-snow-causes-metrodome-roof-collapse http://philblog.com/raw-video-snow-causes-metrodome-roof-collapse

Having been to the Metrodome, I can't imagine the pressure / wind inside during the collapse.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Thu, 09 Dec 2010 15:06:00 -0800 Kinect finally fulfills its Minority Report destiny (video) -- Engadget http://philblog.com/kinect-finally-fulfills-its-minority-report-d http://philblog.com/kinect-finally-fulfills-its-minority-report-d

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Wed, 08 Dec 2010 22:54:00 -0800 Stuxnet: It's the real thing, baby | The Best Defense http://philblog.com/stuxnet-its-the-real-thing-baby-the-best-defe http://philblog.com/stuxnet-its-the-real-thing-baby-the-best-defe
Media_httpricksforeig_wghgd

Wow! - Used "legitimate certificates stolen from two certificate authorities" to digitally sign Stuxnet code to be installed on target machines.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Tue, 07 Dec 2010 21:15:00 -0800 Egypt: Sinai shark attacks could be Israeli plot http://philblog.com/egypt-sinai-shark-attacks-could-be-israeli-pl http://philblog.com/egypt-sinai-shark-attacks-could-be-israeli-pl
Media_httpwwwjpostcom_amhgf

Nice one Egypt.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Mon, 06 Dec 2010 22:23:00 -0800 Inception in Real-Time http://philblog.com/inception-in-real-time http://philblog.com/inception-in-real-time

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Mon, 06 Dec 2010 08:03:00 -0800 Kinect turned into a quadrocopter radar (video) -- Engadget http://philblog.com/kinect-turned-into-a-quadrocopter-radar-video http://philblog.com/kinect-turned-into-a-quadrocopter-radar-video

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Sun, 05 Dec 2010 13:04:00 -0800 Progressive Data Solutions - Hadoop on Rails http://philblog.com/progressive-data-solutions-hadoop-on-rails http://philblog.com/progressive-data-solutions-hadoop-on-rails

Hadoop on Rails

A few months ago, I wrote an article about using Ruby with Hadoop, and more specifically, the Amazon Elastic MapReduce (EMR) service. I hope some of you found that article helpful.

I figured it was time to post a follow-up - using Rails with Hadoop. Much of my work over the past few months has been building a system to efficiently store, process, and display large amounts of log data. Naturally I wanted to use Rails, but I also knew that I needed to use EMR. I was tasked with building the system myself, so rather than spend more time building and maintaining a Hadoop cluster, I opted to use EMR up-front and focus on the entire process.

So what have I done with Hadoop and Rails? Essentially, I’ve built a system that processes large amounts of log data end-to-end with Ruby, Hadoop/Pig, and Rails.

So what’s going on in this image? Following is a description of each step. Note that all Ruby scripts / processing are done within a Rails app, often with script/runner to enable access to the apps data model:

  1. Log data is collected and stored in S3. A Ruby script on an EC2 instance, started by cron, downloads the log data for the previous day, consolidates the many separate files into one, and then compress the file using bzip2 compression.
  2. A Ruby script sends the compressed file back to S3, storing it in a new bucket.
  3. The compressed file is also sent to the Rackspace CloudFiles service, for off-site backup.
  4. After log file consolidation and backup is complete, a Ruby script starts an Elastic MapReduce job.
  5. Data for the job, created during steps 1 and 2, is transferred from S3 to the temporary Hadoop cluster created by Elastic MapReduce. The data is processed using a Pig script which is also stored in an S3 bucket.
  6. Results of the EMR processing are stored in S3, in a separate bucket.
  7. Later, after the Elastic MapReduce job is complete, the output is downloaded via a Ruby script to the tmp/ directory within the Rails app.
  8. Once the data is downloaded, it is processed within the context of the Rails app, and loaded into a MySQL database residing on Amazon’s Relational Database Service.

Notice that each step makes use of Ruby and / or Rails. Ruby really is the glue that holds this system together, and it’s a very powerful glue. A lot of what I am doing is date-specific, and Ruby’s date library and methods make parsing and handling dates much easier (than using shell scrips).

The other language used in this system, Pig, is used to filter, count, and group the large datasets. Once Pig has done its work, running on EMR, the output is just a series of text files that are parsed by Ruby, then stored in MySQL using ActiveRecord relationships. Hadoop / Pig does the heavy lifting, while Ruby / Rails controls everything.

Each part of the system is designed to grow as needed. If log combination and compression is taking too long, it can be modified to run on a larger more powerful EC2 instance. Once that process gets too big for EC2, it could be moved into its own EMR process, using as many machines as necessary.

Likewise, if the Hadoop/Pig processing takes too long, more machines can be added by adjusting one line in the controlling script. Even the MySQL storage can be increased or moved to a more powerful server if needed, thanks to RDS and its simple API.

The biggest challenge in getting this system up and running was learning Pig. Once you understand that Pig is really for filtering, grouping, and counting data, you realize its power. Pig is not Turing Complete, so it can be challenging to solve problems with it. For instance, there are no loops, which can make certain types of problems difficult to solve. I worked around some problems I encountered by moving some of the processing into the Rails app.

There are issues with this system that will need to be addressed soon. The log file combination and compression will need to be improved. I’ll probably switch from bzip2 compression to using splitable LZO, as detailed in this article. Twitter is doing some pretty cool things with Pig / Hadoop and they make a strong case for using LZO. Another issue I’ll be looking at soon is how to streamline the EMR job process. I’m adding more jobs and at some point I’ll have to abstract what I’m doing into some sort of framework. There’s just too much code duplication there.

Let me know if you have questions.

Testing Posterous with an older article I wrote.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Sun, 05 Dec 2010 13:03:00 -0800 Progressive Data Solutions - Ruby on Hadoop Quickstart http://philblog.com/progressive-data-solutions-ruby-on-hadoop-qui http://philblog.com/progressive-data-solutions-ruby-on-hadoop-qui

Ruby on Hadoop Quickstart

After my recent experiences with CouchDB (which is a great product) I was forced to look for something that could handle large amounts of data more efficiently. After doing some research, I settled on Hadoop.

If you are dealing with truly large amounts of data, in the multiple terabyte range or larger, there really are only a few options available to efficiently store and process that data. If you are a company with money to burn, you can talk to Oracle. If that doesn’t appeal to you, you can do what many companies are doing - using Hadoop to store and process their data.

I built a small prototype using Hadoop over the course of a few weeks and really liked what I saw. Hadoop is based on the Google File System, and is an Apache Foundation project. The Hadoop project has seen steady growth over the past few years, with contributions from engineers at Yahoo, Facebook, and others. Many companies now run Hadoop on large clusters of machines and use it to store and process many Terabytes or even Petabytes of data.

I quickly realized that Hadoop would be able to handle the requirements of the project I was working on, but I also realized it was complex and has a steep learning curve. Hadoop is written in Java, so you can download the source, compile, and run yourself. Doing this makes setting up even a small cluster challenging. Fortunately, there are companies like Cloudera that provide pre-configured images for EC2, VMWare, etc. Using Cloudera, I was able to get a small Hadoop cluster running on EC2 pretty quickly.

Even with Cloudera’s help, I realized I would be spending too much of my time configuring and maintaining servers. Hadoop has a daunting list of configuration options, and at this point I would rather spend time learning MapReduce concepts and other data processing tools like Pig and Hive. I also wanted to take advantage of Hadoop Streaming, which allows you to write MapReduce programs in your language of choice (Ruby) and process data. I love Ruby, and have no desire to use Java.

Luckily, there is a solution out there that met my goals. Using Amazon’s Elastic MapReduce service, it’s possible to spin up a Hadoop cluster with minimal effort. In fact, all that is needed is an account and a browser. Once the cluster is running, you can access the master node via ssh and get to work. By doing this, I was able to use Ruby scripts for Streaming and also the Pig interface. You need to have your data stored on S3, but since I was already using S3, processing the data was easy.
Assuming you have an Amazon Web Services account setup for Elastic MapReduce, here are the steps to get a Hadoop cluster up and running in ‘interactive mode’:

- Login to the AWS Management Console

login to aws


Uploaded with plasq’s Skitch!

- Click on the ‘Amazon Elastic MapReduce’ tab and choose ‘Create New Job Flow’

Create New Job Flow


Uploaded with plasq’s Skitch!

- Give the Job Flow a name, and be sure to select the ‘Pig Program’ option. Click Continue.

Pig Program


Uploaded with plasq’s Skitch!

- Rather than executing a Pig script, we want to start an interactive session. Click Continue.

Interactive Pig Session


Uploaded with plasq’s Skitch!

- Choose how many instances you want in the Hadoop cluster and the type. ‘m1.small’ is fine for testing purposes. You’ll want larger instances for real work.  Be sure to select a key pair to use. You will use this key pair to ssh into the master node. Click Continue.

Instance Type


Uploaded with plasq’s Skitch!

- Review the settings you chose, then click ‘Create Job Flow’

At this point, Amazon will create the Hadoop cluster. It usually takes a few minutes, so this would be a good time to check your Twitter client. Remember that you must manually shut down this cluster. In non-interactive mode, Elastic MapReduce will start, run the scripts you ask it to, then shut down the cluster. In interactive mode, you are responsible for terminating the cluster when you are done.

When the state of your job flow is ‘waiting’ you will be able to ssh into the master node. Copy the ‘Master Public DNS Name’ and ssh to the cluster using the following command:

Hadoop Waiting


Uploaded with plasq’s Skitch! ssh to master node


Uploaded with plasq’s Skitch!

Note the ‘hadoop’ username. If all goes well, you will see a waiting prompt after successfully connecting to the master node. Do a quick ps ax’ and you will see that several Hadoop processes are running.

Next, we need to get some data into the Hadoop cluster to work with. Assuming you have some type of data stored in S3, you can create a ‘data’ directory in the Hadoop file system with this command:

‘hadoop fs -mkdir /data’

We also need a directory for output:

‘hadoop fs -mkdir /output’

Run hadoop fs -ls /’ and you should see the two directories. Next, assuming you have data residing in an S3 bucket, you can copy the contents of that bucket to your newly created ‘data’ directory with this command:

‘hadoop fs -cp s3://your_bucket_name/* /data’

Remember that all data transfer to / from S3 and EC2 instances is free, so do not be afraid to copy a big chunk of data to the Hadoop cluster. The only constraint you have is that the EC2 instances have a limited amount of local disk space. There are ways around this, but for this exercise, you will probably want to work with a relatively small amount of data. Several Gigabytes max.

Once you have the data copied to the Hadoop cluster, you can work with it using a variety of methods. You could type ‘pig’ and be dropped into the grunt shell. Or you could submit MapReduce jobs written in Java via the Hadoop command line interface. But you’re reading this because you want to use Ruby with Hadoop, so we’ll do that.

Hadoop has a method of processing data called Streaming, where data is literally streamed line-by-line to a script via STDIN / STDOUT. This is slower than compiled Java, but it’s also much more convenient. You can start working with Hadoop and learning MapReduce while using a language you are comfortable with. The basic process behind streaming is to write your map and reduce scripts, then submit a Hadoop job via the command line interface, telling it where your scripts are and where the data is.

We could do that, but we are going to take one step back and use a great Ruby interface to Hadoop Streaming called Wukong.

In order to use wukong, we need to somehow install it on the hadoop master node. Because we don’t have root or even sudo access, we can’t just install a gem. What I ended up doing is downloading wukong to my local machine, then using scp to copy it to the master node.

‘scp -i /path_to_your_key/key_name ~/Downloads/wukong.zip hadoop@ec2-67-202-43-146.compute-1.amazonaws.com:wukong.zip’

Put the unzipped files in a directory called ‘wukong’ - the following Ruby scripts will look there.

We are now ready to write our Ruby MapReduce program. For demo purposes, I am going to show you a simple example that I adapted from the Wukong examples. The data I am working with happens to be log data from S3, and I am interested in counting unique IPs over the course of a few months. The following script does just that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/ruby

# must place in dir containing the wukong code
$: << File.dirname(__FILE__)+'/wukong/lib'

require 'wukong'

module WordCount
  class Mapper < Wukong::Streamer::LineStreamer
    # Emit each word in the line.
    def process line
      words = line.strip.split(" ").reject(&:blank?)
      yield [words[4], 1]
    end
  end
  
  class Reducer < Wukong::Streamer::ListReducer
    def finalize
      total = 0
      counts = values.map(&:last).map(&:to_i)
      counts.each {|x| total += x}
      yield [ key, total ]
    end
  end
end
    
Wukong::Script.new(WordCount::Mapper, WordCount::Reducer).run

#!/usr/bin/ruby

# must place in dir containing the wukong code
$: << File.dirname(__FILE__)+'/wukong/lib'

require 'wukong'

module WordCount
  class Mapper < Wukong::Streamer::LineStreamer
    # Emit each word in the line.
    def process line
      words = line.strip.split(" ").reject(&:blank?)
      yield [words[4], 1]
    end
  end

  class Reducer < Wukong::Streamer::ListReducer
    def finalize
      total = 0
      counts = values.map(&:last).map(&:to_i)
      counts.each {|x| total += x}
      yield [ key, total ]
    end
  end
end

Wukong::Script.new(WordCount::Mapper, WordCount::Reducer).run

view raw gistfile1.rb This Gist brought to you by GitHub.

Save your script in the same directory as the ‘wukong’ directory. Before we can run the script, we must first tell wukong where Hadoop is, as well as make some of the wukong utilities available:

‘export HADOOP_HOME=/home/hadoop’

‘export PATH=~/scripts/wukong/bin:$PATH’

Finally, it’s time to run a MapReduce job! Be sure your script is executable, then run it using these options:

’./wukong_demo.rb —run=hadoop /data /output’

Note the /data and /output options. The first tells wukong where the input data is located, the second tells it where you want MapReduce to place the results. You should see output similar to the following while your job runs:

hadoop progress


Uploaded with plasq’s Skitch!

Note that it will probably take several minutes for your MapReduce job to run. It all depends on how much data you have. Even a small dataset will take three or four minutes.

Once your job is complete, you can view the results in the /output directory on the Hadoop cluster.

‘hadoop fs -ls /output/’

‘hadoop fs -cat /output/output_file_name’

At this point, we’ve pretty much covered the basics of using Ruby with Hadoop. There are many issues and options that I have not covered, but I’ll leave those to you to explore and figure out.

The bottom line is that you can use Ruby with Hadoop, and Amazon makes it even easier with their Elastic MapReduce service. When you need a full-time Hadoop cluster, spend the time and money to learn and build one. For now, pay for what you use on Amazon, and focus on learning MapReduce concepts and Hadoop fundamentals.

It’s amazing to me that this type of processing power is available on a pay-for-use basis. Running a 100 node Hadoop cluster for a few hours would be cheap and very efficient. That type of compute power was only available to a select few companies and governments even three or four years ago.

Have fun and let me know if you have questions.

Testing Posterous with an older article I wrote.

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger
Sun, 05 Dec 2010 12:59:00 -0800 How an RC airplane buzzed the Statue of Liberty, with no arrests http://philblog.com/how-an-rc-airplane-buzzed-the-statue-of-liber http://philblog.com/how-an-rc-airplane-buzzed-the-statue-of-liber

Permalink | Leave a comment  »

]]>
http://files.posterous.com/user_profile_pics/890186/4374347412_661ce52885.jpg http://posterous.com/users/1kQJfu7iQGsh Phil Ripperger Phil Phil Ripperger