怎样在 intellij spark 源码中开发 spark 应用

本博客微信公共账号:hadoop123(微信号为:hadoop-123),分享hadoop技术内幕,hadoop最新技术进展,发布hadoop相关职位和求职信息,hadoop技术交流聚会、讲座以及会议等。二维码如下:
前一篇文章“”介绍了如何使用Maven编译生成可直接运行在Hadoop 2.2.0上的Spark jar包,而本文则在此基础上,介绍如何利用Eclipse构建Spark集成开发环境。不建议大家使用eclipse开发spark程序和阅读源代码,推荐使用Intellij IDEA,具体参考文章:。
(1) 准备工作
在正式介绍之前,先要以下软硬件准备:
软件准备:
,可以直接点击这里下载:
Scala 2.9.3版本,Window安装程序可以直接点击这里下载:
Eclipse Scala IDE插件,可直接点击这里下载:
装有Linux或者Windows操作系统的机器一台
(2) 构建Spark集成开发环境
我是在windows操作系统下操作的,流程如下:
步骤1:安装scala 2.9.3:直接点击安装即可。
步骤2:将Eclipse Scala IDE插件中features和plugins两个目录下的所有文件拷贝到Eclipse解压后对应的目录中
步骤3:重新启动Eclipse,点击eclipse右上角方框按钮,如下图所示,展开后,点击“Other….”,查看是否有“Scala”一项,有的话,直接点击打开,否则进行步骤4操作。
步骤4:在Eclipse中,依次选择“Help” –& “Install New Software…”,在打开的卡里填入,并按回车键,可看到以下内容,选择前两项进行安装即可。(由于步骤3已经将jar包拷贝到eclipse中,安装很快,只是疏通一下)安装完后,重复操作一遍步骤3便可。
(3) 使用Scala语言开发Spark程序
在eclipse中,依次选择“File” –&“New” –& “Other…” –&
“Scala Wizard” –& “Scala Project”,创建一个Scala工程,并命名为“SparkScala”。
右击“SaprkScala”工程,选择“Properties”,在弹出的框中,按照下图所示,依次选择“Java Build Path” –&“Libraties” –&“Add External JARs…”,导入文章“”中给出的
assembly/target/scala-2.9.3/目录下的spark-assembly-0.8.1-incubating-hadoop2.2.0.jar,这个jar包也可以自己编译spark生成,放在spark目录下的assembly/target/scala-2.9.3/目录中。
跟创建Scala工程类似,在工程中增加一个Scala Class,命名为:WordCount,整个工程结构如下:
WordCount就是最经典的词频统计程序,它将统计输入目录中所有单词出现的总次数,Scala代码如下:
import org.apache.spark._
import SparkContext._
object WordCount {
def main(args: Array[String]) {
if (args.length != 3 ){
println(&usage is org.test.WordCount &master& &input& &output&&)
val sc = new SparkContext(args(0), &WordCount&,
System.getenv(&SPARK_HOME&), Seq(System.getenv(&SPARK_TEST_JAR&)))
val textFile = sc.textFile(args(1))
val result = textFile.flatMap(line =& line.split(&\\s+&))
.map(word =& (word, 1)).reduceByKey(_ + _)
result.saveAsTextFile(args(2))
在Scala工程中,右击“WordCount.scala”,选择“Export”,并在弹出框中选择“Java” –& “JAR File”,进而将该程序编译成jar包,可以起名为“spark-wordcount-in-scala.jar”,我导出的jar包下载地址是 。
该WordCount程序接收三个参数,分别是master位置,HDFS输入目录和HDFS输出目录,为此,可编写run_spark_wordcount.sh脚本:
# 配置成YARN配置文件存放目录
export YARN_CONF_DIR=/opt/hadoop/yarn-client/etc/hadoop/
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.2.0.jar \
./spark-class org.apache.spark.deploy.yarn.Client \
–jar spark-wordcount-in-scala.jar \
–class WordCount \
–args yarn-standalone \
–args hdfs://hadoop-test/tmp/input \
–args hdfs:/hadoop-test/tmp/output \
–num-workers 1 \
–master-memory 2g \
–worker-memory 2g \
–worker-cores 2
需要注意以下几点:WordCount程序的输入参数通过“-args”指定,每个参数依次单独指定,第二个参数是HDFS上的输入目录,需要事先创建好,并上传几个文本文件,以便统计词频,第三个参数是HDFS上的输出目录,动态创建,运行前不能存在。
直接运行run_spark_wordcount.sh脚本即可得到运算结果。
在运行过程中,发现一个bug,org.apache.spark.deploy.yarn.Client有一个参数“–name”可以指定应用程序名称:
但是使用过程中,该参数会阻塞应用程序,查看源代码发现原来是个bug,该Bug已提交到上:
// 位置:new-yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
case (&--queue&) :: value :: tail =&
amQueue = value
args = tail
case (&--name&) :: value :: tail =&
appName = value
args = tail //漏了这行代码,导致程序阻塞
case (&--addJars&) :: value :: tail =&
addJars = value
args = tail
因此,大家先不要使用“–name”这个参数,或者修复这个bug,重新编译Spark。
(4) 使用Java语言开发Spark程序
方法跟普通的Java程序开发一样,只要将Spark开发程序包spark-assembly-0.8.1-incubating-hadoop2.2.0.jar作为三方依赖库即可。
(5) 总结
初步试用Spark On YARN过程中,发现问题还是非常多,使用起来非常不方便,门槛还是很高,远不如Spark On Mesos成熟。
原创文章,转载请注明: 转载自
本文链接地址:
作者:,作者介绍:
本博客的文章集合:
19 Comments to “Apache Spark学习:利用Eclipse构建Spark集成开发环境”
14/03/12 10:44:25 DEBUG http.HttpParser: HttpParser{s=-14,l=0,c=-3}
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1041)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:375)
at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1035)
… 7 more
14/03/12 10:44:25 DEBUG handler.AbstractHandler: stopping org.eclipse.jetty.server.handler.ResourceHandler@75221ed7
14/03/12 10:44:25 DEBUG http.HttpParser: HttpParser{s=-14,l=0,c=-3}
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1041)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:375)
at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1035)
… 7 more
14/03/12 10:44:25 DEBUG bio.SocketConnector: EOF
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1041)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:190)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
不知道是什么原因,希望老师指点下,谢谢。
hadoop是1.0.3的版本
spark是0.9.0
eclipse是Kepler
Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…
Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…
Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…
Apache Spark学习:利用Eclipse构建Spark集成开发环境 | 董的博客…
编译后的jar,cmd中跑 java -jar xxx.jar
提示 没有主清单属性。什么原因??&&&&Intellij IDEA 搭建Spark开发环境说明
Intellij IDEA 搭建Spark开发环境说明
基于Win7环境,IntelliJ IDEA 搭建Spark开发环境。
嵌到我的页面
<input type="text" readonly="true" value="">
若举报审核通过,可奖励20下载分
被举报人:
举报的资源分:
请选择类型
资源无法下载
资源无法使用
标题与实际内容不符
含有危害国家安全内容
含有反动色情等内容
含广告内容
版权问题,侵犯个人或公司的版权
*详细原因:
VIP下载&&免积分60元/年(1200次)
您可能还需要
课程资源下载排行田毅-Spark开发及本地环境搭建指南_图文_百度文库
两大类热门资源免费畅读
续费一年阅读会员,立省24元!
田毅-Spark开发及本地环境搭建指南
上传于||暂无简介
阅读已结束,如果下载本文需要使用1下载券
想免费下载本文?
定制HR最喜欢的简历
下载文档到电脑,查找使用更方便
还剩36页未读,继续阅读
定制HR最喜欢的简历
你可能喜欢Spark开发环境配置及流程(Intellij IDEA)_图文_百度文库
两大类热门资源免费畅读
续费一年阅读会员,立省24元!
Spark开发环境配置及流程(Intellij IDEA)
上传于||文档简介
&&在Intellij IDEA上开发Spark应用的配置流程
阅读已结束,如果下载本文需要使用1下载券
想免费下载本文?
定制HR最喜欢的简历
下载文档到电脑,查找使用更方便
还剩4页未读,继续阅读
定制HR最喜欢的简历
你可能喜欢【Spark二十二】在Intellij Idea中调试运行Spark应用程序 - bit1129的博客 - ITeye技术网站
博客分类:
scala-2.10.4
之前搭建环境一直不成功,原因可能是使用了Scala-2.11.4版本导致的。Spark的官方网站明确的说Spark-1.2.0不支持Scala2.11.4版本:
Note: Scala 2.11 users should download the Spark source package and build with Scala 2.11 support.
Spark版本:
spark-1.2.0-bin-hadoop2.4.tgz
配置环境变量
export SCALA_HOME=/home/hadoop/spark1.2.0/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
export SPARK_HOME=/home/hadoop/spark1.2.0/spark-1.2.0-bin-hadoop2.4
export PATH=$SPARK_HOME/bin:$PATH
搭建Intellij Idea开发Spark程序的环境
1. 下载安装Scala插件
2. 创建 Scala的Non-SBT项目
3. 导入Spark的jar包
spark-1.2.0-bin-hadoop2.4
4.编写wordcount例子代码
package spark.examples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SparkWordCount {
def main(args: Array[String]) {
///注意setMaster("local")这行代码,表明Spark以local运行(注意local与standalone模式的区别)
val conf = new SparkConf().setAppName("SparkWordCount").setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("file:///home/hadoop/spark1.2.0/word.txt")
rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).map(x =& (x._2, x._1)).sortByKey(false).map(x =& (x._2, x._1)).saveAsTextFile("file:///home/hadoop/spark1.2.0/WordCountResult")
控制台日志:
15/01/14 22:06:34 WARN Utils: Your hostname, hadoop-Inspiron-3521 resolves to a loopback address: 127.0.1.1; using 192.168.0.111 instead (on interface eth1)
15/01/14 22:06:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/01/14 22:06:35 INFO SecurityManager: Changing view acls to: hadoop
15/01/14 22:06:35 INFO SecurityManager: Changing modify acls to: hadoop
15/01/14 22:06:35 INFO SecurityManager: SecurityManager: aut users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/01/14 22:06:36 INFO Slf4jLogger: Slf4jLogger started
15/01/14 22:06:36 INFO Remoting: Starting remoting
15/01/14 22:06:36 INFO Remoting: R listening on addresses :[akka.tcp://sparkDriver@hadoop-Inspiron-3521.local:53624]
15/01/14 22:06:36 INFO Utils: Successfully started service 'sparkDriver' on port 53624.
15/01/14 22:06:36 INFO SparkEnv: Registering MapOutputTracker
15/01/14 22:06:36 INFO SparkEnv: Registering BlockManagerMaster
15/01/14 22:06:36 INFO DiskBlockManager: Created local directory at /tmp/spark-local-36-4826
15/01/14 22:06:36 INFO MemoryStore: MemoryStore started with capacity 461.7 MB
15/01/14 22:06:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/14 22:06:37 INFO HttpFileServer: HTTP File server directory is /tmp/spark-5-498c-9b72-9c6a13684f44
15/01/14 22:06:37 INFO HttpServer: Starting HTTP Server
15/01/14 22:06:38 INFO Utils: Successfully started service 'HTTP file server' on port 53231.
15/01/14 22:06:43 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/01/14 22:06:43 INFO SparkUI: Started SparkUI at http://hadoop-Inspiron-3521.local:4040
15/01/14 22:06:43 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@hadoop-Inspiron-3521.local:53624/user/HeartbeatReceiver
15/01/14 22:06:44 INFO NettyBlockTransferService: Server created on 46971
15/01/14 22:06:44 INFO BlockManagerMaster: Trying to register BlockManager
15/01/14 22:06:44 INFO BlockManagerMasterActor: Registering block manager localhost:46971 with 461.7 MB RAM, BlockManagerId(&driver&, localhost, 46971)
15/01/14 22:06:44 INFO BlockManagerMaster: Registered BlockManager
15/01/14 22:06:44 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=
15/01/14 22:06:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 461.5 MB)
15/01/14 22:06:45 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=
15/01/14 22:06:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 461.5 MB)
15/01/14 22:06:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46971 (size: 22.2 KB, free: 461.7 MB)
15/01/14 22:06:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/01/14 22:06:45 INFO SparkContext: Created broadcast 0 from textFile at SparkWordCount.scala:40
15/01/14 22:06:45 INFO FileInputFormat: Total input paths to process : 1
15/01/14 22:06:45 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/01/14 22:06:45 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/01/14 22:06:45 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/01/14 22:06:45 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/01/14 22:06:45 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/01/14 22:06:46 INFO SparkContext: Starting job: saveAsTextFile at SparkWordCount.scala:43
15/01/14 22:06:46 INFO DAGScheduler: Registering RDD 3 (map at SparkWordCount.scala:43)
15/01/14 22:06:46 INFO DAGScheduler: Registering RDD 5 (map at SparkWordCount.scala:43)
15/01/14 22:06:46 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkWordCount.scala:43) with 1 output partitions (allowLocal=false)
15/01/14 22:06:46 INFO DAGScheduler: Final stage: Stage 2(saveAsTextFile at SparkWordCount.scala:43)
15/01/14 22:06:46 INFO DAGScheduler: Parents of final stage: List(Stage 1)
15/01/14 22:06:46 INFO DAGScheduler: Missing parents: List(Stage 1)
15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:43), which has no missing parents
15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(3560) called with curMem=186397, maxMem=
15/01/14 22:06:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.5 KB, free 461.5 MB)
15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2528) called with curMem=189957, maxMem=
15/01/14 22:06:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 461.5 MB)
15/01/14 22:06:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:46971 (size: 2.5 KB, free: 461.7 MB)
15/01/14 22:06:46 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/01/14 22:06:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/01/14 22:06:46 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[3] at map at SparkWordCount.scala:43)
15/01/14 22:06:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/01/14 22:06:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1292 bytes)
15/01/14 22:06:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/01/14 22:06:46 INFO HadoopRDD: Input split: file:/home/hadoop/spark1.2.0/word.txt:0+29
15/01/14 22:06:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1895 bytes result sent to driver
15/01/14 22:06:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 323 ms on localhost (1/1)
15/01/14 22:06:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/01/14 22:06:46 INFO DAGScheduler: Stage 0 (map at SparkWordCount.scala:43) finished in 0.350 s
15/01/14 22:06:46 INFO DAGScheduler: looking for newly runnable stages
15/01/14 22:06:46 INFO DAGScheduler: running: Set()
15/01/14 22:06:46 INFO DAGScheduler: waiting: Set(Stage 1, Stage 2)
15/01/14 22:06:46 INFO DAGScheduler: failed: Set()
15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 1: List()
15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 2: List(Stage 1)
15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[5] at map at SparkWordCount.scala:43), which is now runnable
15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2992) called with curMem=192485, maxMem=
15/01/14 22:06:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 461.5 MB)
15/01/14 22:06:46 INFO MemoryStore: ensureFreeSpace(2158) called with curMem=195477, maxMem=
15/01/14 22:06:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 461.5 MB)
15/01/14 22:06:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:46971 (size: 2.1 KB, free: 461.7 MB)
15/01/14 22:06:46 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/01/14 22:06:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/01/14 22:06:46 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[5] at map at SparkWordCount.scala:43)
15/01/14 22:06:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/01/14 22:06:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1045 bytes)
15/01/14 22:06:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/01/14 22:06:46 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/01/14 22:06:46 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 12 ms
15/01/14 22:06:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1000 bytes result sent to driver
15/01/14 22:06:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 110 ms on localhost (1/1)
15/01/14 22:06:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/01/14 22:06:46 INFO DAGScheduler: Stage 1 (map at SparkWordCount.scala:43) finished in 0.106 s
15/01/14 22:06:46 INFO DAGScheduler: looking for newly runnable stages
15/01/14 22:06:46 INFO DAGScheduler: running: Set()
15/01/14 22:06:46 INFO DAGScheduler: waiting: Set(Stage 2)
15/01/14 22:06:46 INFO DAGScheduler: failed: Set()
15/01/14 22:06:46 INFO DAGScheduler: Missing parents for Stage 2: List()
15/01/14 22:06:46 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[8] at saveAsTextFile at SparkWordCount.scala:43), which is now runnable
15/01/14 22:06:47 INFO MemoryStore: ensureFreeSpace(112880) called with curMem=197635, maxMem=
15/01/14 22:06:47 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 110.2 KB, free 461.4 MB)
15/01/14 22:06:47 INFO MemoryStore: ensureFreeSpace(67500) called with curMem=310515, maxMem=
15/01/14 22:06:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 65.9 KB, free 461.3 MB)
15/01/14 22:06:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:46971 (size: 65.9 KB, free: 461.6 MB)
15/01/14 22:06:47 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/01/14 22:06:47 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/01/14 22:06:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[8] at saveAsTextFile at SparkWordCount.scala:43)
15/01/14 22:06:47 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/01/14 22:06:47 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1056 bytes)
15/01/14 22:06:47 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/01/14 22:06:47 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/01/14 22:06:47 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/01/14 22:06:47 INFO FileOutputCommitter: Saved output of task 'attempt__0002_m_' to file:/home/hadoop/spark1.2.0/WordCountResult/_temporary/0/task__0002_m_000000
15/01/14 22:06:47 INFO SparkHadoopWriter: attempt__0002_m_: Committed
15/01/14 22:06:47 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 824 bytes result sent to driver
15/01/14 22:06:47 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 397 ms on localhost (1/1)
15/01/14 22:06:47 INFO DAGScheduler: Stage 2 (saveAsTextFile at SparkWordCount.scala:43) finished in 0.399 s
15/01/14 22:06:47 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/01/14 22:06:47 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkWordCount.scala:43, took 1.241181 s
15/01/14 22:06:47 INFO SparkUI: Stopped Spark web UI at http://hadoop-Inspiron-3521.local:4040
15/01/14 22:06:47 INFO DAGScheduler: Stopping DAGScheduler
15/01/14 22:06:48 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/01/14 22:06:48 INFO MemoryStore: MemoryStore cleared
15/01/14 22:06:48 INFO BlockManager: BlockManager stopped
15/01/14 22:06:48 INFO BlockManagerMaster: BlockManagerMaster stopped
15/01/14 22:06:48 INFO SparkContext: Successfully stopped SparkContext
15/01/14 22:06:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/01/14 22:06:48 INFO RemoteActorRefProvider$RemotingTerminator: Rem proceeding with flushing remote transports.
Process finished with exit code 0
调整日志级别
从上面的输出中,可以看到,Spark默认是输出INFO级别的日志,为了查看全部的日志,可以设置Spark的日志输出,办法是在wordcount项目的源代码根目录创建一个log4j.properties文件,其中的内容
log4j.rootCategory=DEBUG, file
log4j.appender.file=org.apache.log4j.ConsoleAppender
#如果要把日志输出到某个文件中,则使用FileAppender
#log4j.appender.file=org.apache.log4j.FileAppender
#log4j.appender.file.file=spark.log
log4j.appender.file.append=false
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN
org.eclipse.jetty.LEVEL=WARN
注意问题:
在Windows上搭建Spark的开发环境报错,
15/01/17 16:17:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/17 16:17:04 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.&clinit&(Shell.java:326)
at org.apache.hadoop.util.StringUtils.&clinit&(StringUtils.java:76)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
at org.apache.hadoop.security.Groups.&init&(Groups.java:77)
at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
at org.apache.spark.deploy.SparkHadoopUtil.&init&(SparkHadoopUtil.scala:43)
at org.apache.spark.deploy.SparkHadoopUtil$.&init&(SparkHadoopUtil.scala:202)
at org.apache.spark.deploy.SparkHadoopUtil$.&clinit&(SparkHadoopUtil.scala)
at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
at org.apache.spark.storage.BlockManager.&init&(BlockManager.scala:105)
at org.apache.spark.storage.BlockManager.&init&(BlockManager.scala:180)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
at org.apache.spark.SparkContext.&init&(SparkContext.scala:232)
at spark.examples.SparkWordCount$.main(SparkWordCount.scala:39)
at spark.examples.SparkWordCount.main(SparkWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
解决办法:
1. 解压一个Hadoop,在环境变量里HADOOP_HOME
HADOOP_HOME=E:/devsoftware/hadoop-2.5.2/hadoop-2.5.2
如果环境变量不好使,则在代码中添加如下语句
System.setProperty("hadoop.home.dir", "E:/devsoftware/hadoop-2.5.2/hadoop-2.5.2");
2. 将winutils.exe拷贝至bin目录下
E:/devsoftware/hadoop-2.5.2/hadoop-2.5.2/bin
下载地址:
http://download.csdn.net/download/zxyacb
不需要 hadoop.dll文件
经过几天的折腾,终于打破僵局,可以调试Spark的源代码了,接下来加快步伐,提升Spark的修为!
浏览: 250333 次
来自: 北京
看了你的文章,updateStateByKey 这个方式的使用 ...
棒极啦,解决了我的问题。
你好,这个代码生成主要在,那个地方使用。
看楼主这么厉害的样子,请问楼主如何知道类库的版本呢?比如g++ ...
JavaDStream&String& line ...

我要回帖

更多关于 intellij maven spark 的文章

 

随机推荐