WEBVTT

1
00:00:04.040 --> 00:00:06.070 A:middle L:90%
Okay, thank you. So I'm going to talk

2
00:00:06.070 --> 00:00:10.050 A:middle L:90%
about the ecosystem for the new HPC. Uh HPC

3
00:00:10.050 --> 00:00:13.160 A:middle L:90%
here is really referring to heterogeneous parallel computing rather than

4
00:00:13.539 --> 00:00:17.219 A:middle L:90%
traditional, high performance computing that you'll typically hear about

5
00:00:17.230 --> 00:00:20.280 A:middle L:90%
on campus. Um And the reason for that generally

6
00:00:20.280 --> 00:00:22.329 A:middle L:90%
is when people talk about high performance computing, they're

7
00:00:22.329 --> 00:00:25.559 A:middle L:90%
thinking about large scale supercomputing and aspects of that nature

8
00:00:25.940 --> 00:00:27.910 A:middle L:90%
and a lot of what I've done, although it

9
00:00:27.920 --> 00:00:31.280 A:middle L:90%
tends to get uh tends to get pigeonholed in the

10
00:00:31.280 --> 00:00:35.719 A:middle L:90%
supercomputing, high performance computing realm. Uh It actually

11
00:00:35.729 --> 00:00:39.990 A:middle L:90%
spans much larger uh breadth of activities. We're doing

12
00:00:39.990 --> 00:00:43.659 A:middle L:90%
quite a bit of work with embedded devices uh this

13
00:00:43.670 --> 00:00:46.649 A:middle L:90%
particular maybe not this one, but these particular devices

14
00:00:46.649 --> 00:00:50.659 A:middle L:90%
right now they have integrated CPU and GPU cores inside

15
00:00:50.659 --> 00:00:53.179 A:middle L:90%
them. So there's a tremendous amount of parallel computing

16
00:00:53.179 --> 00:00:56.320 A:middle L:90%
capability here that needs to be untapped in order to

17
00:00:56.320 --> 00:01:00.729 A:middle L:90%
get the appropriate uh end user experience. So with

18
00:01:00.729 --> 00:01:03.299 A:middle L:90%
that, let me give you a high level overview

19
00:01:03.299 --> 00:01:06.900 A:middle L:90%
of some of the things that we're doing. So

20
00:01:06.900 --> 00:01:10.890 A:middle L:90%
I said that we're doing all work that encompasses parallel

21
00:01:10.890 --> 00:01:15.519 A:middle L:90%
computing from system software to middleware to applications tools and

22
00:01:15.519 --> 00:01:19.340 A:middle L:90%
libraries and um you'll see a number of the projects,

23
00:01:19.340 --> 00:01:22.250 A:middle L:90%
although it's probably not coming up in the greatest

24
00:01:22.260 --> 00:01:26.030 A:middle L:90%
uh it's a little bit faded out but see systems

25
00:01:26.030 --> 00:01:30.540 A:middle L:90%
networking, green computing and renaissance renaissance, kind of

26
00:01:30.549 --> 00:01:33.700 A:middle L:90%
the rebirth of computing and how it supports all the

27
00:01:33.700 --> 00:01:38.920 A:middle L:90%
different myriad of aspects of our life from typical scientific

28
00:01:38.920 --> 00:01:42.049 A:middle L:90%
computing to what we do from day to day with

29
00:01:42.049 --> 00:01:45.329 A:middle L:90%
our smartphones uh and the like, um and we

30
00:01:45.329 --> 00:01:48.060 A:middle L:90%
do it all the way from something as small as

31
00:01:48.060 --> 00:01:49.900 A:middle L:90%
a mobile phone or or in this case particularly ipad

32
00:01:49.900 --> 00:01:52.980 A:middle L:90%
two things that you've probably heard a little bit more

33
00:01:52.980 --> 00:01:55.640 A:middle L:90%
about uh in terms of what we've been doing and

34
00:01:55.640 --> 00:01:59.560 A:middle L:90%
that is uh with our hokey speed supercomputer which debuted

35
00:01:59.560 --> 00:02:02.939 A:middle L:90%
as the most energy efficient supercomputer, most energy efficient

36
00:02:02.939 --> 00:02:07.909 A:middle L:90%
commodity supercomputer in the US Last year. Well a

37
00:02:07.909 --> 00:02:12.780 A:middle L:90%
little over a year ago November 2011. Um

38
00:02:12.789 --> 00:02:15.219 A:middle L:90%
to give you a little bit of a perspective on

39
00:02:15.229 --> 00:02:15.419 A:middle L:90%
on some of the work that we're doing, I'm

40
00:02:15.419 --> 00:02:21.969 A:middle L:90%
gonna focus primarily today on a broader scope project in

41
00:02:21.969 --> 00:02:25.069 A:middle L:90%
the area of heterogeneous parallel computing with a bit of

42
00:02:25.069 --> 00:02:29.550 A:middle L:90%
a focus on graphics processing units. Um That's not

43
00:02:29.550 --> 00:02:31.500 A:middle L:90%
where all of our work resides. Uh You can

44
00:02:31.500 --> 00:02:36.789 A:middle L:90%
see this synergistic uh diagram here. We're do we've

45
00:02:36.789 --> 00:02:40.000 A:middle L:90%
been doing work with um uh cloud computing in particular

46
00:02:40.000 --> 00:02:44.159 A:middle L:90%
making use of the Microsoft Azure cloud is part of

47
00:02:44.159 --> 00:02:47.069 A:middle L:90%
the NSF Microsoft grant on how you can leverage not

48
00:02:47.069 --> 00:02:50.219 A:middle L:90%
just the cloud but the combination of the client and

49
00:02:50.219 --> 00:02:53.560 A:middle L:90%
the cloud to affect uh the answers that you get

50
00:02:53.569 --> 00:02:57.189 A:middle L:90%
at speed at which you get them. Um We've

51
00:02:57.189 --> 00:03:00.830 A:middle L:90%
also been doing uh work in extensive work in in

52
00:03:00.830 --> 00:03:02.439 A:middle L:90%
the green computing space. A lot of work on

53
00:03:02.439 --> 00:03:07.599 A:middle L:90%
energy proportional uh data centers. Um This has been

54
00:03:07.599 --> 00:03:09.949 A:middle L:90%
a particular interest to google of late um I don't

55
00:03:09.949 --> 00:03:13.930 A:middle L:90%
see my PhD student here but we've had us several

56
00:03:13.930 --> 00:03:15.509 A:middle L:90%
overtures from google in terms of adapting what we've been

57
00:03:15.509 --> 00:03:19.750 A:middle L:90%
doing in the green computing space for their data centers

58
00:03:19.750 --> 00:03:22.789 A:middle L:90%
in terms of energy proportionality. And what that typically

59
00:03:22.789 --> 00:03:27.699 A:middle L:90%
means is being able to only expend as much energy

60
00:03:27.710 --> 00:03:32.210 A:middle L:90%
as you impart work load on the processor. So

61
00:03:32.210 --> 00:03:35.539 A:middle L:90%
if you actually did a graph of of the amount

62
00:03:35.539 --> 00:03:38.849 A:middle L:90%
of energy you consume with very low amount of workload

63
00:03:38.860 --> 00:03:43.580 A:middle L:90%
, you'll find that the idle power power consumption just

64
00:03:43.590 --> 00:03:47.659 A:middle L:90%
leave the computer on is pretty high. Um They

65
00:03:47.659 --> 00:03:51.930 A:middle L:90%
like your desktop on and running. It's it's on

66
00:03:51.930 --> 00:03:54.360 A:middle L:90%
the order of about 150 watts. That's just burning

67
00:03:54.370 --> 00:03:57.960 A:middle L:90%
uh burning power even though the workload on it is

68
00:03:57.960 --> 00:04:00.240 A:middle L:90%
zero. So the workload on it is zero,

69
00:04:00.240 --> 00:04:01.689 A:middle L:90%
you like the power consumption or energy consumption to be

70
00:04:01.689 --> 00:04:04.740 A:middle L:90%
zero, you want to be energy proportional. Um

71
00:04:04.750 --> 00:04:06.120 A:middle L:90%
And so those are some of the things that that

72
00:04:06.120 --> 00:04:10.740 A:middle L:90%
are of interest to google were also, let's see

73
00:04:10.750 --> 00:04:13.259 A:middle L:90%
other things. We we've recently done some work with

74
00:04:13.639 --> 00:04:17.579 A:middle L:90%
Professor Yao and cybersecurity in terms of leveraging, heterogeneous

75
00:04:17.579 --> 00:04:24.949 A:middle L:90%
computing for uh for for cyber security aspects and I

76
00:04:24.949 --> 00:04:27.089 A:middle L:90%
won't be able to get into that very much.

77
00:04:27.100 --> 00:04:30.029 A:middle L:90%
Um And in some sense, what started it all

78
00:04:30.029 --> 00:04:30.899 A:middle L:90%
, some of the work with Professor Alexa knew bree

79
00:04:30.899 --> 00:04:35.500 A:middle L:90%
f where we are taking his molecular dynamics code and

80
00:04:35.500 --> 00:04:40.100 A:middle L:90%
mapping it onto these emerging heterogeneous computing environments in order

81
00:04:40.100 --> 00:04:46.060 A:middle L:90%
to accelerate the discovery of binding sites um for molecules

82
00:04:48.339 --> 00:04:51.060 A:middle L:90%
. Um So the short of it is is that

83
00:04:51.069 --> 00:04:55.000 A:middle L:90%
we're encompassing all aspects of parallel computing from the embedded

84
00:04:55.000 --> 00:04:57.709 A:middle L:90%
space to the high performance computing space. We want

85
00:04:57.709 --> 00:05:00.860 A:middle L:90%
to provide scientists and engineers with scalable and efficient computational

86
00:05:00.860 --> 00:05:04.230 A:middle L:90%
tools and system software that enable them to concentrate on

87
00:05:04.230 --> 00:05:08.149 A:middle L:90%
their science and engineering rather than on the computer science

88
00:05:08.149 --> 00:05:11.629 A:middle L:90%
and engineering. And a lot of the work that

89
00:05:11.629 --> 00:05:14.160 A:middle L:90%
I'll be presenting a day is is has been contributed

90
00:05:14.160 --> 00:05:16.509 A:middle L:90%
by many, many graduate students and postdocs as well

91
00:05:16.509 --> 00:05:18.959 A:middle L:90%
as myself. And so I just want to recognize

92
00:05:18.959 --> 00:05:20.949 A:middle L:90%
these people here rather than at the end of the

93
00:05:20.949 --> 00:05:24.500 A:middle L:90%
talk when I probably just throw the slide up there

94
00:05:24.500 --> 00:05:26.980 A:middle L:90%
really fast and not get a chance to really properly

95
00:05:26.980 --> 00:05:30.939 A:middle L:90%
acknowledge the folks that are uh responsible and and contributing

96
00:05:30.939 --> 00:05:34.480 A:middle L:90%
to this larger endeavor that we have in terms of

97
00:05:34.480 --> 00:05:39.720 A:middle L:90%
heterogeneous parallel computing. So let me just start out

98
00:05:39.720 --> 00:05:46.339 A:middle L:90%
with some quick background here. Um uh This is

99
00:05:46.339 --> 00:05:48.949 A:middle L:90%
something that occurred about a decade ago. Um It

100
00:05:48.949 --> 00:05:54.079 A:middle L:90%
was dubbed compute nick because the japanese had debuted a

101
00:05:54.079 --> 00:05:57.839 A:middle L:90%
supercomputer that was five times faster than anything else in

102
00:05:57.839 --> 00:06:00.329 A:middle L:90%
the world. And then if you added the next

103
00:06:00.339 --> 00:06:03.329 A:middle L:90%
20 supercomputers in the world, they would equal this

104
00:06:03.339 --> 00:06:08.810 A:middle L:90%
single supercomputer. And so this this event got dubbed

105
00:06:08.810 --> 00:06:10.779 A:middle L:90%
as compute nick. It's a play on words with

106
00:06:10.779 --> 00:06:15.759 A:middle L:90%
respect to the Sputnik event and it led to a

107
00:06:15.540 --> 00:06:17.759 A:middle L:90%
the study that was put out by the U.

108
00:06:17.759 --> 00:06:21.339 A:middle L:90%
S. Council of competitiveness that was funded by NSF

109
00:06:21.339 --> 00:06:25.120 A:middle L:90%
and the Department of Energy. And what is interesting

110
00:06:25.120 --> 00:06:27.889 A:middle L:90%
to see here. I mean this really stunned me

111
00:06:28.139 --> 00:06:31.420 A:middle L:90%
when it first came out is that This is a

112
00:06:31.420 --> 00:06:36.730 A:middle L:90%
sampling of over 200 companies in the us and not

113
00:06:36.730 --> 00:06:43.209 A:middle L:90%
just computer companies but A whole swath of companies including

114
00:06:43.209 --> 00:06:46.829 A:middle L:90%
Fortune 500 companies like Procter and gamble. Where do

115
00:06:46.829 --> 00:06:51.430 A:middle L:90%
you find Procter and Gamble? You're like what's Proctor

116
00:06:51.430 --> 00:07:01.449 A:middle L:90%
in jail? PNG what's that? Their consumer products

117
00:07:01.459 --> 00:07:04.220 A:middle L:90%
, you walk into a grocery store? Probably about

118
00:07:04.220 --> 00:07:08.009 A:middle L:90%
half the products in there are either made by or

119
00:07:08.009 --> 00:07:12.560 A:middle L:90%
distributed by Procter and Gamble Cincinnati. Ohio okay.

120
00:07:13.439 --> 00:07:15.000 A:middle L:90%
And they're one of the many that said that they

121
00:07:15.000 --> 00:07:18.800 A:middle L:90%
could not exist or compete unless they had high performance

122
00:07:18.800 --> 00:07:21.329 A:middle L:90%
computing. So if you look at this chart There's

123
00:07:21.329 --> 00:07:27.149 A:middle L:90%
only 3% that said they could exist and compete without

124
00:07:27.639 --> 00:07:32.019 A:middle L:90%
high performance computing. And the other 97% said that

125
00:07:32.029 --> 00:07:34.990 A:middle L:90%
they needed it to compete. And the one example

126
00:07:34.990 --> 00:07:36.610 A:middle L:90%
I don't have it, I don't have a picture

127
00:07:36.610 --> 00:07:39.360 A:middle L:90%
of it here. But one of the examples that

128
00:07:39.370 --> 00:07:44.579 A:middle L:90%
is always amusing to talk about is a they do

129
00:07:44.579 --> 00:07:47.689 A:middle L:90%
computational fluid dynamics on the way that Pringles potato chips

130
00:07:47.689 --> 00:07:49.629 A:middle L:90%
fly off the conveyor belt and how they stack them

131
00:07:49.629 --> 00:07:53.490 A:middle L:90%
inside the potato chip. Can I wonder how did

132
00:07:53.490 --> 00:07:54.980 A:middle L:90%
they get it in? So perfectly do they have

133
00:07:54.980 --> 00:07:58.410 A:middle L:90%
a human stacking them. No of course not they

134
00:07:58.420 --> 00:08:00.339 A:middle L:90%
want to automate this process and they do this through

135
00:08:00.350 --> 00:08:03.560 A:middle L:90%
through high performance computing in this case. Um so

136
00:08:03.560 --> 00:08:07.720 A:middle L:90%
more recently this came out, so we had the

137
00:08:07.720 --> 00:08:11.389 A:middle L:90%
computer pick one, which was the japanese coming out

138
00:08:11.399 --> 00:08:16.629 A:middle L:90%
and and smashing us supremacy and supercomputing. And similarly

139
00:08:16.629 --> 00:08:20.009 A:middle L:90%
something occurred recently was called tien ho won. A

140
00:08:20.019 --> 00:08:22.680 A:middle L:90%
people viewed this as the second coming of compute nick

141
00:08:22.689 --> 00:08:24.600 A:middle L:90%
. Um but I didn't really view it that way

142
00:08:24.600 --> 00:08:26.459 A:middle L:90%
. It was only, it was only 43% faster

143
00:08:26.540 --> 00:08:30.740 A:middle L:90%
Than the previous number one supercomputer. But the reason

144
00:08:30.740 --> 00:08:31.750 A:middle L:90%
why it really garnered a lot of attention was it

145
00:08:31.750 --> 00:08:41.379 A:middle L:90%
was$20 million 42% less power. And so what

146
00:08:41.379 --> 00:08:41.870 A:middle L:90%
really is this, this is really what I call

147
00:08:41.870 --> 00:08:45.029 A:middle L:90%
the second coming of the Bay Wolf cluster, which

148
00:08:45.029 --> 00:08:48.779 A:middle L:90%
is a further commoditization of high performance computing. It's

149
00:08:48.779 --> 00:08:50.879 A:middle L:90%
commoditize ng it by using commodity parts that you can

150
00:08:50.879 --> 00:08:52.820 A:middle L:90%
largely by off the shelf, you go up to

151
00:08:52.820 --> 00:08:54.889 A:middle L:90%
your best buy and you take those parts, you

152
00:08:54.889 --> 00:08:58.860 A:middle L:90%
integrate it with other parts and then you put glue

153
00:08:58.100 --> 00:09:05.179 A:middle L:90%
software that integrates this into a singular um a singular

154
00:09:05.179 --> 00:09:07.779 A:middle L:90%
cohesive integrated computing environment rather than what looked like would

155
00:09:07.789 --> 00:09:11.500 A:middle L:90%
otherwise look like a gaggle of separate pcs or servers

156
00:09:11.639 --> 00:09:13.230 A:middle L:90%
. Okay, so what do I mean by a

157
00:09:13.230 --> 00:09:16.169 A:middle L:90%
Bay Wolf cluster? Well, the first coming of

158
00:09:16.169 --> 00:09:18.929 A:middle L:90%
the Bay Wolf cluster, the idea is that back

159
00:09:18.929 --> 00:09:20.929 A:middle L:90%
in the nineties, the only people that could run

160
00:09:20.929 --> 00:09:24.409 A:middle L:90%
on supercomputers were those few privileged people that had access

161
00:09:24.409 --> 00:09:28.200 A:middle L:90%
to the so called big iron supercomputing environments. And

162
00:09:28.200 --> 00:09:31.690 A:middle L:90%
so what folks at Nasa Goddard did is they said

163
00:09:31.690 --> 00:09:33.029 A:middle L:90%
, well we can build our own supercomputer, we

164
00:09:33.029 --> 00:09:37.039 A:middle L:90%
can With the confluence of Lennox being available, which

165
00:09:37.039 --> 00:09:39.490 A:middle L:90%
is an open source operating system with PCS that were

166
00:09:39.490 --> 00:09:43.049 A:middle L:90%
getting powerful enough to do computing power, people would

167
00:09:43.049 --> 00:09:46.019 A:middle L:90%
just buy PC towers. This is a picture from

168
00:09:46.019 --> 00:09:48.919 A:middle L:90%
1994 by the way. Um they would gaggle them

169
00:09:48.919 --> 00:09:52.220 A:middle L:90%
together and then they would create their own supercomputer and

170
00:09:52.220 --> 00:09:54.019 A:middle L:90%
so now what you're seeing is they're leveraging this.

171
00:09:54.029 --> 00:09:58.129 A:middle L:90%
They back then leverage commodity components from the operating system

172
00:09:58.129 --> 00:10:01.860 A:middle L:90%
to typical hardware that you can buy from the store

173
00:10:01.340 --> 00:10:03.629 A:middle L:90%
. Now we're doing the second coming of it,

174
00:10:03.629 --> 00:10:07.639 A:middle L:90%
is that in addition to leveraging these pcs, we're

175
00:10:07.639 --> 00:10:09.620 A:middle L:90%
going to leverage something inside of this called the graphics

176
00:10:09.620 --> 00:10:15.809 A:middle L:90%
processing unit to accelerate computation And the graphics processing unit

177
00:10:15.809 --> 00:10:18.500 A:middle L:90%
is what's driving everyone of the displays that is open

178
00:10:18.509 --> 00:10:22.809 A:middle L:90%
right now on your laptops and typically your displays 1920

179
00:10:22.809 --> 00:10:26.519 A:middle L:90%
by 1080 pixels, that's a million pixels that have

180
00:10:26.519 --> 00:10:31.470 A:middle L:90%
to get updated every 30 milliseconds. That's a tremendous

181
00:10:31.470 --> 00:10:35.190 A:middle L:90%
amount of computing capability in terms of doing a very

182
00:10:35.190 --> 00:10:37.570 A:middle L:90%
specific task. And that's updating the pixels on your

183
00:10:37.570 --> 00:10:41.070 A:middle L:90%
display. So then a question as well, can

184
00:10:41.070 --> 00:10:46.960 A:middle L:90%
you not then leverage that computing capability, that simple

185
00:10:46.960 --> 00:10:50.059 A:middle L:90%
computing capability and scale it to do millions of computations

186
00:10:50.059 --> 00:10:54.070 A:middle L:90%
on something else, rather than just doing the pixels

187
00:10:54.070 --> 00:10:56.740 A:middle L:90%
on your display and that's where the notion of graphics

188
00:10:56.740 --> 00:11:00.429 A:middle L:90%
processing for computing comes about. So you can see

189
00:11:00.429 --> 00:11:03.759 A:middle L:90%
that this is just a visual display that uh borrowed

190
00:11:03.759 --> 00:11:05.529 A:middle L:90%
out of google and the ideas that you can,

191
00:11:05.539 --> 00:11:07.879 A:middle L:90%
instead of just updating the pixels on the display for

192
00:11:07.879 --> 00:11:11.309 A:middle L:90%
that picture, you can start using it for for

193
00:11:11.309 --> 00:11:15.929 A:middle L:90%
real computation. And so um we have projects that

194
00:11:15.929 --> 00:11:18.990 A:middle L:90%
are related to all of these areas right now ongoing

195
00:11:18.000 --> 00:11:22.779 A:middle L:90%
in support of the uh the sciences and engineering that

196
00:11:22.779 --> 00:11:26.129 A:middle L:90%
are running on top of Gps. Um so this

197
00:11:26.129 --> 00:11:30.769 A:middle L:90%
is a an example of some of the speed ups

198
00:11:30.769 --> 00:11:33.669 A:middle L:90%
that we're getting and this is generally with respect to

199
00:11:33.669 --> 00:11:35.450 A:middle L:90%
a serial CPU. Okay, so if you happen

200
00:11:35.450 --> 00:11:39.159 A:middle L:90%
to run it on a quad core CPU that is

201
00:11:39.159 --> 00:11:41.789 A:middle L:90%
four CPU processors and of course you have to divide

202
00:11:41.789 --> 00:11:43.759 A:middle L:90%
all of those speed up by a factor of four

203
00:11:46.039 --> 00:11:48.929 A:middle L:90%
. Um so you can see that the speed up

204
00:11:48.929 --> 00:11:52.970 A:middle L:90%
varies tremendously. And this is where you have this

205
00:11:52.970 --> 00:11:58.440 A:middle L:90%
confluence of underlying architecture married with the system software and

206
00:11:58.440 --> 00:12:01.690 A:middle L:90%
application software and how that maps onto the underlying hardware

207
00:12:03.070 --> 00:12:05.220 A:middle L:90%
. So from an algorithmic perspective, you have things

208
00:12:05.220 --> 00:12:09.399 A:middle L:90%
that are very regular in computation, like the display

209
00:12:09.399 --> 00:12:11.759 A:middle L:90%
of pixels on a laptop. That's very regular,

210
00:12:13.139 --> 00:12:15.879 A:middle L:90%
all of the, all of the pixels are doing

211
00:12:15.879 --> 00:12:18.809 A:middle L:90%
the same thing. We're actually all of the processing

212
00:12:18.809 --> 00:12:20.740 A:middle L:90%
cores that are generating the pixel color are doing the

213
00:12:20.740 --> 00:12:24.019 A:middle L:90%
same thing, they're calculating the color. Okay,

214
00:12:24.029 --> 00:12:26.860 A:middle L:90%
So if you have computations that are the same way

215
00:12:26.870 --> 00:12:28.360 A:middle L:90%
let's say you have this huge a rate and every

216
00:12:28.740 --> 00:12:31.700 A:middle L:90%
array element is doing the same thing. Okay,

217
00:12:31.700 --> 00:12:35.759 A:middle L:90%
so it's a single instruction multiple data. You have

218
00:12:35.759 --> 00:12:37.169 A:middle L:90%
a single instruction that says okay, calculate this on

219
00:12:37.179 --> 00:12:41.590 A:middle L:90%
every single one of the array elements That you have

220
00:12:41.840 --> 00:12:45.110 A:middle L:90%
. That's a very regular computation and that's how you

221
00:12:45.110 --> 00:12:48.470 A:middle L:90%
generally get much higher speed ups is noted like in

222
00:12:48.470 --> 00:12:50.960 A:middle L:90%
the upper left we have 146-fold speed up versus something

223
00:12:50.960 --> 00:12:52.669 A:middle L:90%
that you have this more irregular. So if you

224
00:12:52.669 --> 00:12:56.830 A:middle L:90%
think about things like a graph traversal or graph search

225
00:12:56.830 --> 00:12:58.850 A:middle L:90%
that your facebook has come out with. That's a

226
00:12:58.850 --> 00:13:03.000 A:middle L:90%
very irregular computation. You go along different nodes of

227
00:13:03.000 --> 00:13:05.610 A:middle L:90%
your graph and some partial portions of the graph will

228
00:13:05.610 --> 00:13:09.690 A:middle L:90%
be much more computational intensity in terms of the way

229
00:13:09.690 --> 00:13:11.629 A:middle L:90%
it's connected. Other parts are less computation intensive.

230
00:13:11.629 --> 00:13:15.059 A:middle L:90%
And so you start to have an unbalanced workload.

231
00:13:15.139 --> 00:13:16.730 A:middle L:90%
It's very irregular. That doesn't allow you to exploit

232
00:13:16.730 --> 00:13:20.470 A:middle L:90%
the capabilities of graphics processing unit to its fullest.

233
00:13:20.480 --> 00:13:22.110 A:middle L:90%
And so you'll get lesser speed up slow down in

234
00:13:22.110 --> 00:13:24.759 A:middle L:90%
the 17 fold or 18 fold for example there.

235
00:13:28.240 --> 00:13:31.769 A:middle L:90%
So way to view this is that this is my

236
00:13:31.779 --> 00:13:35.440 A:middle L:90%
my way of a uh I don't want to call

237
00:13:35.440 --> 00:13:37.370 A:middle L:90%
it dumbing it down but just bringing it down to

238
00:13:37.370 --> 00:13:39.509 A:middle L:90%
a level that everyone can understand my own parents is

239
00:13:39.509 --> 00:13:43.240 A:middle L:90%
that you kind of you the CPU as your sport

240
00:13:43.240 --> 00:13:46.909 A:middle L:90%
utility vehicle. It does everything well but nothing Expertly

241
00:13:46.909 --> 00:13:50.700 A:middle L:90%
well. Um so you can turn left and right

242
00:13:50.700 --> 00:13:52.559 A:middle L:90%
, it can go pretty fast on the highway.

243
00:13:52.570 --> 00:13:54.940 A:middle L:90%
Um but like if you wanted to corner around a

244
00:13:54.940 --> 00:13:58.970 A:middle L:90%
90° turn at 150 miles an hour, it's probably

245
00:13:58.970 --> 00:14:01.679 A:middle L:90%
not a good idea because you're probably gonna roll the

246
00:14:01.690 --> 00:14:03.269 A:middle L:90%
sport utility vehicle over. So it does things well

247
00:14:03.279 --> 00:14:07.450 A:middle L:90%
. Generally um the GPU is very specialized, it

248
00:14:07.450 --> 00:14:11.409 A:middle L:90%
does things very specifically well uh and it goes in

249
00:14:11.409 --> 00:14:13.360 A:middle L:90%
this case the drag cars, it goes straight.

250
00:14:15.139 --> 00:14:18.600 A:middle L:90%
If you if you reach an intersection, a decision

251
00:14:18.600 --> 00:14:20.149 A:middle L:90%
point and you say well if there's traffic in front

252
00:14:20.149 --> 00:14:22.100 A:middle L:90%
of me then I want to turn left. It

253
00:14:22.100 --> 00:14:24.370 A:middle L:90%
doesn't do that well. If you think of a

254
00:14:24.370 --> 00:14:26.789 A:middle L:90%
drag car racer goes 400 miles an hour, however

255
00:14:26.789 --> 00:14:28.919 A:middle L:90%
fast it goes, it doesn't just be able to

256
00:14:28.919 --> 00:14:31.350 A:middle L:90%
turn on a dime, it also doesn't have brakes

257
00:14:31.840 --> 00:14:35.110 A:middle L:90%
. So it doesn't have preemption. Can't preempt kernels

258
00:14:35.120 --> 00:14:37.139 A:middle L:90%
or execution kernels, that's the same thing. And

259
00:14:37.139 --> 00:14:39.720 A:middle L:90%
what a Gpu does, you have to execute the

260
00:14:39.730 --> 00:14:43.230 A:middle L:90%
code until completion. You can't just preempt it in

261
00:14:43.230 --> 00:14:50.080 A:middle L:90%
the middle and start up another job. So all

262
00:14:50.080 --> 00:14:52.549 A:middle L:90%
of this is uh so you can kind of view

263
00:14:52.549 --> 00:14:54.090 A:middle L:90%
this as a metaphor. CPI you might be your

264
00:14:54.090 --> 00:14:56.289 A:middle L:90%
left brain. Gpu might be your right brain ideas

265
00:14:56.289 --> 00:15:01.490 A:middle L:90%
that you have a, you no longer have a

266
00:15:01.490 --> 00:15:05.320 A:middle L:90%
single type of brain or single type of multiple types

267
00:15:05.330 --> 00:15:07.879 A:middle L:90%
, uh multiple instances of a single type of brain

268
00:15:07.889 --> 00:15:09.899 A:middle L:90%
, but in fact we have two types of different

269
00:15:09.899 --> 00:15:13.350 A:middle L:90%
brains that you need to be able to figure out

270
00:15:13.940 --> 00:15:18.980 A:middle L:90%
which tool or process or tool is appropriate for the

271
00:15:18.980 --> 00:15:22.279 A:middle L:90%
task that you have at hand. The idea of

272
00:15:22.279 --> 00:15:24.519 A:middle L:90%
what we're trying to do in our work is that

273
00:15:24.529 --> 00:15:28.009 A:middle L:90%
we want to alleviate the burden of you as an

274
00:15:28.009 --> 00:15:33.559 A:middle L:90%
end user to have to manually do that mapping and

275
00:15:33.559 --> 00:15:39.330 A:middle L:90%
that optimization. So this is what we're calling.

276
00:15:39.330 --> 00:15:41.360 A:middle L:90%
Heterogeneous parallel computing. And again, I'll just just

277
00:15:41.740 --> 00:15:46.509 A:middle L:90%
just very briefly show you this is uh this is

278
00:15:46.519 --> 00:15:52.090 A:middle L:90%
something that we have in terms of hokey speed that

279
00:15:52.100 --> 00:15:56.559 A:middle L:90%
uh the Gpu accelerated super computer for the masses um

280
00:15:56.039 --> 00:16:02.220 A:middle L:90%
and it's 209 nodes where each node consists of a

281
00:16:02.230 --> 00:16:06.029 A:middle L:90%
traditional CPU motherboard. So this is uh where the

282
00:16:06.029 --> 00:16:07.330 A:middle L:90%
CPU resides. The two yellow blobs is where the

283
00:16:07.330 --> 00:16:11.320 A:middle L:90%
cpus get dropped in. Um there were 26 core

284
00:16:11.330 --> 00:16:15.669 A:middle L:90%
Cpus uh so there's 12 cores of computing capability on

285
00:16:15.669 --> 00:16:21.350 A:middle L:90%
that. And that is then supplemented by Gpu s

286
00:16:21.379 --> 00:16:25.159 A:middle L:90%
two and video. Tesla fermi Gpu cards. Um

287
00:16:26.539 --> 00:16:27.190 A:middle L:90%
so I'm going to break the flow here just for

288
00:16:27.190 --> 00:16:29.940 A:middle L:90%
a little bit uh for a moment. So it's

289
00:16:29.940 --> 00:16:33.840 A:middle L:90%
like I'm going to talk very briefly about the the

290
00:16:33.840 --> 00:16:37.179 A:middle L:90%
opportunity for folks to be able to uh take a

291
00:16:37.179 --> 00:16:41.659 A:middle L:90%
quick tour of this facility. Uh March one or

292
00:16:41.659 --> 00:16:45.970 A:middle L:90%
2nd we have graduate preview weekend that's coming up.

293
00:16:45.340 --> 00:16:49.519 A:middle L:90%
Uh and I need uh participants in terms of either

294
00:16:49.529 --> 00:16:56.350 A:middle L:90%
providing posters or uh serving on panels to answer questions

295
00:16:56.350 --> 00:17:00.610 A:middle L:90%
to incoming prospective graduate students uh to tell them what

296
00:17:00.620 --> 00:17:03.970 A:middle L:90%
life is like here, who's doing research on what

297
00:17:03.970 --> 00:17:06.819 A:middle L:90%
and what have you uh as well as just dealing

298
00:17:06.819 --> 00:17:08.109 A:middle L:90%
with logistics, we have we have a need for

299
00:17:08.109 --> 00:17:11.549 A:middle L:90%
drivers to get back and forth between on campus and

300
00:17:11.549 --> 00:17:15.009 A:middle L:90%
off campus for uh these prospective students. Um We

301
00:17:15.009 --> 00:17:18.460 A:middle L:90%
also will be taking to the dinner and things of

302
00:17:18.460 --> 00:17:27.130 A:middle L:90%
that nature. Yes. Barbara. Mhm. If

303
00:17:27.130 --> 00:17:37.269 A:middle L:90%
you're Mhm. Here. Mhm. Right. Sure

304
00:17:51.140 --> 00:17:56.299 A:middle L:90%
. Yeah, So what I'm going to do is

305
00:17:56.309 --> 00:17:57.869 A:middle L:90%
I'm going to send out a sign up sheet,

306
00:17:57.880 --> 00:18:00.440 A:middle L:90%
it just has name, email, and then what

307
00:18:00.440 --> 00:18:04.329 A:middle L:90%
your contribution might be for the graduate preview weekend on

308
00:18:04.329 --> 00:18:07.900 A:middle L:90%
March one and second, which I should put here

309
00:18:07.910 --> 00:18:11.869 A:middle L:90%
. Um This is voluntary, I'm not gonna force

310
00:18:11.869 --> 00:18:12.599 A:middle L:90%
you all, but you know, this is uh

311
00:18:12.609 --> 00:18:15.880 A:middle L:90%
this is giving back to our community, so the

312
00:18:15.890 --> 00:18:22.250 A:middle L:90%
contribution will either be a poster panel or logistics or

313
00:18:22.250 --> 00:18:23.859 A:middle L:90%
any combination of those or all of them. Okay

314
00:18:26.539 --> 00:18:30.190 A:middle L:90%
, And this will be on, March 1st two

315
00:18:30.190 --> 00:18:36.220 A:middle L:90%
weeks from today. Okay, so back to our

316
00:18:36.230 --> 00:18:48.329 A:middle L:90%
original programming. So uh this uh what we did

317
00:18:48.329 --> 00:18:49.640 A:middle L:90%
is we're trying to look at this notion of trying

318
00:18:49.640 --> 00:18:55.180 A:middle L:90%
to commoditize supercomputing uh for the masses, and really

319
00:18:55.190 --> 00:18:56.319 A:middle L:90%
that's actually a little bit of a red hearing.

320
00:18:56.319 --> 00:18:59.470 A:middle L:90%
What I just said is that we're not necessarily doing

321
00:18:59.470 --> 00:19:00.660 A:middle L:90%
. Supercomputing a lot of the work that we're doing

322
00:19:00.660 --> 00:19:03.269 A:middle L:90%
in the embedded space is really just trying to extract

323
00:19:03.640 --> 00:19:06.109 A:middle L:90%
all the performance that we can get out of a

324
00:19:06.109 --> 00:19:08.089 A:middle L:90%
parallel computing environment, whether it be a traditional supercomputer

325
00:19:08.099 --> 00:19:11.819 A:middle L:90%
all the way down to a desktop and even smartphone

326
00:19:11.829 --> 00:19:15.460 A:middle L:90%
or or uh mobile device. And so the whole

327
00:19:18.539 --> 00:19:19.559 A:middle L:90%
uh this is, I'm going to skip over that

328
00:19:19.569 --> 00:19:22.509 A:middle L:90%
. So the whole idea is we want to do

329
00:19:22.509 --> 00:19:26.029 A:middle L:90%
a software ecosystem that supports heterogeneous parallel computing exploits intra

330
00:19:26.029 --> 00:19:29.990 A:middle L:90%
node parallelism and then commoditize is this for everyone to

331
00:19:29.990 --> 00:19:32.990 A:middle L:90%
use. So it's not so difficult that you have

332
00:19:32.990 --> 00:19:37.960 A:middle L:90%
to become a PhD in power beating and heterogeneous computing

333
00:19:37.960 --> 00:19:41.299 A:middle L:90%
in order to enable uh in order to enable the

334
00:19:41.309 --> 00:19:45.250 A:middle L:90%
capability of extracting that kind of performance to support your

335
00:19:45.250 --> 00:19:48.710 A:middle L:90%
work and it can span a number of places.

336
00:19:48.720 --> 00:19:49.950 A:middle L:90%
I mean like I said, we've worked with Professor

337
00:19:49.950 --> 00:19:53.730 A:middle L:90%
YaO and cybersecurity. There's some serious scaling problems there

338
00:19:53.730 --> 00:19:57.430 A:middle L:90%
in terms of tackling intrusion detection and and leaks and

339
00:19:57.430 --> 00:20:00.549 A:middle L:90%
things like that. There's things in the in the

340
00:20:00.549 --> 00:20:03.789 A:middle L:90%
life sciences like next generation sequencing, the amount of

341
00:20:03.789 --> 00:20:07.029 A:middle L:90%
data that's being generated is doubling at a much faster

342
00:20:07.029 --> 00:20:10.890 A:middle L:90%
rate than their computing capability. So depending on what

343
00:20:10.890 --> 00:20:12.859 A:middle L:90%
you look at the amount of data and next generation

344
00:20:12.859 --> 00:20:18.400 A:middle L:90%
sequencing a DNA sequencing is doubling every nine months And

345
00:20:18.410 --> 00:20:21.920 A:middle L:90%
our computing capabilities only doubling every 18 months. So

346
00:20:21.920 --> 00:20:23.420 A:middle L:90%
we can't just throw hardware at the problem and expect

347
00:20:23.420 --> 00:20:26.920 A:middle L:90%
it to to to be able to bridge that.

348
00:20:26.930 --> 00:20:30.089 A:middle L:90%
We have to take a more fresher, more innovative

349
00:20:30.089 --> 00:20:33.720 A:middle L:90%
look at how we can uh how we can leverage

350
00:20:33.730 --> 00:20:38.640 A:middle L:90%
the hardware combined with system software, algorithms and applications

351
00:20:38.640 --> 00:20:42.460 A:middle L:90%
in a unique way in order to bridge that gap

352
00:20:44.640 --> 00:20:48.279 A:middle L:90%
. And we're gonna do this through parallel computing and

353
00:20:48.289 --> 00:20:51.450 A:middle L:90%
it's everywhere now. I mean it's gonna be it's

354
00:20:51.460 --> 00:20:55.269 A:middle L:90%
almost impossible now to go out and buy a computing

355
00:20:55.269 --> 00:20:57.269 A:middle L:90%
device with a single core inside of it. Pretty

356
00:20:57.269 --> 00:21:00.089 A:middle L:90%
much every phone now is multi core. All smartphones

357
00:21:00.089 --> 00:21:03.970 A:middle L:90%
are multi core laptops are all multi core and so

358
00:21:03.970 --> 00:21:11.339 A:middle L:90%
on. And so what this amounts due ubiquitous parallelism

359
00:21:11.339 --> 00:21:12.680 A:middle L:90%
. And the key is how to extract that performance

360
00:21:12.930 --> 00:21:18.190 A:middle L:90%
and scale out. Um so one way of doing

361
00:21:18.190 --> 00:21:22.369 A:middle L:90%
this as well before I get to that. So

362
00:21:22.380 --> 00:21:23.359 A:middle L:90%
what this amounts to is in short is that our

363
00:21:23.359 --> 00:21:26.029 A:middle L:90%
free lunches over, what we used to do is

364
00:21:26.029 --> 00:21:26.920 A:middle L:90%
we just go out the best buy and we get

365
00:21:26.920 --> 00:21:30.819 A:middle L:90%
the next computer that has the next higher gigahertz clock

366
00:21:30.819 --> 00:21:33.799 A:middle L:90%
rating and we would get faster performance, but we

367
00:21:33.799 --> 00:21:36.039 A:middle L:90%
can't do that anymore. And now the burden falls

368
00:21:36.039 --> 00:21:37.920 A:middle L:90%
on you all to figure out how to exploit the

369
00:21:37.920 --> 00:21:42.660 A:middle L:90%
parallel hardware to get the appropriate performance or the necessary

370
00:21:42.660 --> 00:21:47.650 A:middle L:90%
performance. Um And so the question here is what

371
00:21:47.650 --> 00:21:49.500 A:middle L:90%
can we do as researchers in parallel computing to lower

372
00:21:49.500 --> 00:21:52.839 A:middle L:90%
this cost of concurrency. So one way of going

373
00:21:52.839 --> 00:21:56.960 A:middle L:90%
about doing this, this is the Berkeley view,

374
00:21:56.539 --> 00:21:59.910 A:middle L:90%
it started out at Lawrence Berkeley National Lab and it's

375
00:21:59.910 --> 00:22:03.859 A:middle L:90%
now part of uh the UC. Berkeley landscape.

376
00:22:03.240 --> 00:22:07.329 A:middle L:90%
The idea is that applications target existing hardware and programming

377
00:22:07.329 --> 00:22:11.509 A:middle L:90%
models. But what we should do is we should

378
00:22:11.509 --> 00:22:15.750 A:middle L:90%
design hardware that keeps in mind future applications. So

379
00:22:15.750 --> 00:22:18.920 A:middle L:90%
the idea is rather than build the hardware and then

380
00:22:18.920 --> 00:22:22.920 A:middle L:90%
evaluate the hardware with various benchmarks and applications, we

381
00:22:22.920 --> 00:22:27.680 A:middle L:90%
instead think about what future applications there will be and

382
00:22:27.680 --> 00:22:33.250 A:middle L:90%
have that influence hardware design in the future. And

383
00:22:33.259 --> 00:22:34.869 A:middle L:90%
how do you go about doing that? So what

384
00:22:34.869 --> 00:22:41.400 A:middle L:90%
is this this magic uh software? Well, what

385
00:22:41.410 --> 00:22:44.940 A:middle L:90%
Berkeley posited was? Well, If you look at

386
00:22:44.940 --> 00:22:51.299 A:middle L:90%
least signed at scientific codes, there's 13 computational dwarfs

387
00:22:51.299 --> 00:22:52.029 A:middle L:90%
as they call it, where they view a computational

388
00:22:52.039 --> 00:22:56.630 A:middle L:90%
dwarf as a pattern of communication and computation. That

389
00:22:56.630 --> 00:23:00.900 A:middle L:90%
is common across a set of applications. So those

390
00:23:00.900 --> 00:23:03.490 A:middle L:90%
of you that are uh taking or hopefully have taken

391
00:23:03.490 --> 00:23:07.480 A:middle L:90%
an algorithms class the theory class. You think about

392
00:23:07.490 --> 00:23:11.329 A:middle L:90%
the various types of algorithms that you leverage in order

393
00:23:11.329 --> 00:23:15.890 A:middle L:90%
to achieve, uh what your end goal is.

394
00:23:15.890 --> 00:23:19.569 A:middle L:90%
So you can think about and body problems, uh

395
00:23:19.579 --> 00:23:23.519 A:middle L:90%
you can calculate the interactions between every pair of n

396
00:23:23.519 --> 00:23:29.799 A:middle L:90%
bodies within the universe and cosmology. The gravitational pull

397
00:23:29.799 --> 00:23:32.769 A:middle L:90%
between every pairs of n bodies. You can then

398
00:23:32.769 --> 00:23:36.460 A:middle L:90%
take that same execution signature, shrink it down to

399
00:23:36.460 --> 00:23:40.910 A:middle L:90%
the molecule level and look at understanding the electrostatic surface

400
00:23:40.910 --> 00:23:45.339 A:middle L:90%
potential interactions between different pairs of atoms within a molecule

401
00:23:45.420 --> 00:23:49.269 A:middle L:90%
. To understand the proclivity of another molecule to bind

402
00:23:49.279 --> 00:23:55.089 A:middle L:90%
to a particular site. That's that's an example of

403
00:23:55.089 --> 00:23:57.900 A:middle L:90%
one computational dwarf in end body one that it spans

404
00:23:57.900 --> 00:24:02.170 A:middle L:90%
from the large in outer space down to the very

405
00:24:02.170 --> 00:24:04.859 A:middle L:90%
tiny that you can't even see in atomic and molecular

406
00:24:04.859 --> 00:24:10.779 A:middle L:90%
space. So these are these are the ones that

407
00:24:10.779 --> 00:24:12.170 A:middle L:90%
came out in fact. So why did they get

408
00:24:12.170 --> 00:24:15.029 A:middle L:90%
started calling by being called dwarf? Turns out there

409
00:24:15.029 --> 00:24:18.420 A:middle L:90%
were seven original ones. And that's the reason why

410
00:24:18.430 --> 00:24:21.359 A:middle L:90%
uh got dubbed as dwarfs and Snow White and the

411
00:24:21.359 --> 00:24:23.849 A:middle L:90%
seven dwarfs for those that aren't familiar with the literature

412
00:24:25.339 --> 00:24:27.269 A:middle L:90%
. And 1/6 additional ones got added since then.

413
00:24:27.839 --> 00:24:30.890 A:middle L:90%
The purpose of what I'm going to talk about is

414
00:24:30.890 --> 00:24:32.849 A:middle L:90%
not to to go through each, each one of

415
00:24:32.849 --> 00:24:33.400 A:middle L:90%
these in great detail, but just to give you

416
00:24:33.400 --> 00:24:37.210 A:middle L:90%
an idea that what we're looking to do is in

417
00:24:37.210 --> 00:24:41.170 A:middle L:90%
creating our software ecosystem. We're going to use these

418
00:24:41.170 --> 00:24:44.349 A:middle L:90%
dwarfs as a guide with which to help us uh

419
00:24:44.819 --> 00:24:48.859 A:middle L:90%
Personalized super computing for the masses through this notion of

420
00:24:49.240 --> 00:24:55.069 A:middle L:90%
um het originate of hardware and software that automatically tunes

421
00:24:55.069 --> 00:24:59.299 A:middle L:90%
for uh tunes for performance with respect to performance,

422
00:24:59.299 --> 00:25:00.670 A:middle L:90%
power and program ability. And we're going to do

423
00:25:00.670 --> 00:25:03.890 A:middle L:90%
it. The this benchmark of sweets. Uh the

424
00:25:03.890 --> 00:25:07.599 A:middle L:90%
dwarfs that I just talked about except our own version

425
00:25:07.599 --> 00:25:10.259 A:middle L:90%
of it and I'll make allusion to it Now,

426
00:25:10.259 --> 00:25:11.440 A:middle L:90%
I also talk about later is called open dwarfs.

427
00:25:11.450 --> 00:25:14.970 A:middle L:90%
So if you do a google search on open dwarfs

428
00:25:14.980 --> 00:25:17.420 A:middle L:90%
um with the F. S. At the end

429
00:25:17.420 --> 00:25:18.789 A:middle L:90%
, instead of the es you'll find an open source

430
00:25:18.789 --> 00:25:22.549 A:middle L:90%
project they're available for people to use. Um So

431
00:25:22.549 --> 00:25:26.019 A:middle L:90%
we're gonna look to have a multidimensional understanding of how

432
00:25:26.019 --> 00:25:29.579 A:middle L:90%
to optimize uh these different aspects for each one of

433
00:25:29.579 --> 00:25:32.119 A:middle L:90%
those aspects or some combination thereof. And we're doing

434
00:25:32.119 --> 00:25:33.799 A:middle L:90%
it from embedded space all the way to the data

435
00:25:33.799 --> 00:25:37.500 A:middle L:90%
center space. Um In particular, what we're seeking

436
00:25:37.500 --> 00:25:41.990 A:middle L:90%
to do is uh create an ecosystem, such as

437
00:25:41.990 --> 00:25:44.339 A:middle L:90%
the one that's noted in the in the lower box

438
00:25:44.339 --> 00:25:47.750 A:middle L:90%
here, the software ecosystem to support the myriad of

439
00:25:47.750 --> 00:25:52.700 A:middle L:90%
applications that we've been collaborating with. Um So most

440
00:25:52.700 --> 00:25:56.119 A:middle L:90%
recently we we landed a very large, multi million

441
00:25:56.119 --> 00:26:00.670 A:middle L:90%
dollar grant to support this work, we're doing an

442
00:26:00.680 --> 00:26:04.339 A:middle L:90%
avionics composites and they're used in these micro drones.

443
00:26:06.539 --> 00:26:08.180 A:middle L:90%
So the thing is that it seems kind of silly

444
00:26:08.180 --> 00:26:11.509 A:middle L:90%
for us to stovepipe a solution for each and every

445
00:26:11.509 --> 00:26:15.849 A:middle L:90%
one of these applications and so we should create uh

446
00:26:15.859 --> 00:26:17.990 A:middle L:90%
abstractions of you oil. And this is what computer

447
00:26:17.990 --> 00:26:21.740 A:middle L:90%
science is all about, create abstractions and and powerful

448
00:26:21.740 --> 00:26:23.259 A:middle L:90%
enough tools that will be able to support all of

449
00:26:23.259 --> 00:26:27.769 A:middle L:90%
these um uh different types of applications. And so

450
00:26:30.039 --> 00:26:30.759 A:middle L:90%
when I'm going to give you a very, very

451
00:26:30.769 --> 00:26:37.710 A:middle L:90%
fast um breeze through of this ecosystem is shown here

452
00:26:37.710 --> 00:26:40.200 A:middle L:90%
in this yellow line. But what I want to

453
00:26:40.200 --> 00:26:41.829 A:middle L:90%
point out is that we're gonna talk about the dwarfs

454
00:26:41.829 --> 00:26:42.480 A:middle L:90%
and we're gonna use that as a way to not

455
00:26:42.480 --> 00:26:45.039 A:middle L:90%
just guide hardware design, which we have been doing

456
00:26:45.049 --> 00:26:49.109 A:middle L:90%
uh, uh, indirectly through an NVIDIA AMD and

457
00:26:49.119 --> 00:26:52.940 A:middle L:90%
intel they have shown interest in specific dwarfs and,

458
00:26:52.950 --> 00:26:56.390 A:middle L:90%
and how that's going to then uh, affect their

459
00:26:56.390 --> 00:27:00.759 A:middle L:90%
architecture will have to see, um, uh,

460
00:27:00.769 --> 00:27:03.670 A:middle L:90%
we have a source to source translation and optimization framework

461
00:27:03.039 --> 00:27:07.049 A:middle L:90%
. So all of those applications up there, they

462
00:27:07.059 --> 00:27:11.660 A:middle L:90%
write their parallel codes in different languages. So,

463
00:27:12.119 --> 00:27:17.059 A:middle L:90%
and there are times that we need to bridge the

464
00:27:17.059 --> 00:27:19.019 A:middle L:90%
gap between the languages that their programming in and the

465
00:27:19.019 --> 00:27:25.299 A:middle L:90%
languages that are underlying that underlying those languages or support

466
00:27:25.299 --> 00:27:29.549 A:middle L:90%
those languages for heterogeneous parallel computing. So it's like

467
00:27:29.940 --> 00:27:32.230 A:middle L:90%
, yeah, it's like if if you have somebody

468
00:27:32.230 --> 00:27:34.430 A:middle L:90%
that speaks english and someone that speaks Swahili in the

469
00:27:34.430 --> 00:27:37.109 A:middle L:90%
same room or you have to find that bridge,

470
00:27:37.109 --> 00:27:38.779 A:middle L:90%
you need a translator to bridge those two. And

471
00:27:38.779 --> 00:27:41.900 A:middle L:90%
that's what this translator is seeking to do, seeking

472
00:27:41.900 --> 00:27:45.710 A:middle L:90%
to bridge what the applications are doing with what worse

473
00:27:45.710 --> 00:27:49.950 A:middle L:90%
doing software wise with heterogeneous computing or in this particular

474
00:27:49.950 --> 00:27:53.009 A:middle L:90%
case, uh, graphics processing units. So we

475
00:27:53.009 --> 00:27:56.460 A:middle L:90%
do the source, the source translation where the universal

476
00:27:56.470 --> 00:27:57.440 A:middle L:90%
were not a universal translator. We love to be

477
00:27:57.440 --> 00:28:00.190 A:middle L:90%
a universal translator, but that's it's a non trivial

478
00:28:00.190 --> 00:28:03.069 A:middle L:90%
task. But we, we carve out the piece

479
00:28:03.069 --> 00:28:04.079 A:middle L:90%
that we're able to try and do and we have

480
00:28:04.079 --> 00:28:07.720 A:middle L:90%
this translator and then what we try to do is

481
00:28:07.730 --> 00:28:11.779 A:middle L:90%
we're trying to optimize this for the user automatically.

482
00:28:11.789 --> 00:28:15.299 A:middle L:90%
Okay. Um, we, we haven't been able

483
00:28:15.299 --> 00:28:15.930 A:middle L:90%
to do it automatically yet. This is more of

484
00:28:15.930 --> 00:28:18.849 A:middle L:90%
a cartoon or a vision of what we see going

485
00:28:18.859 --> 00:28:22.220 A:middle L:90%
forward we're doing right now. Is these architectural optimizations

486
00:28:22.230 --> 00:28:26.579 A:middle L:90%
? Um I had a student uh last year um

487
00:28:26.589 --> 00:28:30.400 A:middle L:90%
who finished his Master's degree and I basically made him

488
00:28:30.400 --> 00:28:33.529 A:middle L:90%
a human compiler. So there's a list of optimizations

489
00:28:33.529 --> 00:28:36.140 A:middle L:90%
that he had to go through. And I said

490
00:28:36.140 --> 00:28:37.950 A:middle L:90%
, all right, I don't really, I'm not

491
00:28:37.950 --> 00:28:40.430 A:middle L:90%
a computer expert, so I don't know how to

492
00:28:40.430 --> 00:28:41.710 A:middle L:90%
formalize this and automate this right now. But we're

493
00:28:41.710 --> 00:28:44.059 A:middle L:90%
going to pretend to do is you're going to be

494
00:28:44.059 --> 00:28:45.569 A:middle L:90%
a human compiler. You're gonna go through all the

495
00:28:45.569 --> 00:28:48.109 A:middle L:90%
different permutations of optimizations that will allow us to improve

496
00:28:48.109 --> 00:28:52.660 A:middle L:90%
the performance of the code. Okay. So um

497
00:28:53.240 --> 00:28:56.410 A:middle L:90%
so he went and did that. And so we

498
00:28:56.420 --> 00:29:00.519 A:middle L:90%
we ended up first initial parallelization. Just source the

499
00:29:00.519 --> 00:29:03.359 A:middle L:90%
source translation would get us 88 fold, speed up

500
00:29:03.359 --> 00:29:04.839 A:middle L:90%
on a GPU. And then once he applied all

501
00:29:04.839 --> 00:29:10.210 A:middle L:90%
his architectural where optimizations, which took It took many

502
00:29:10.210 --> 00:29:14.950 A:middle L:90%
, many weeks. Uh he got another 4.2 fold

503
00:29:15.539 --> 00:29:18.029 A:middle L:90%
. So the aggregate The performance improvement ended up being

504
00:29:18.029 --> 00:29:25.690 A:middle L:90%
about 371 372 fold over cereal. A serial vectorized

505
00:29:25.690 --> 00:29:29.220 A:middle L:90%
execution on an intel processor. Um A lot of

506
00:29:29.220 --> 00:29:32.740 A:middle L:90%
this these optimizations were trying to formalize into a model

507
00:29:32.740 --> 00:29:34.450 A:middle L:90%
so we can decide based on the model if we're

508
00:29:34.450 --> 00:29:38.569 A:middle L:90%
optimizing for performance or power or energy efficiency or both

509
00:29:38.579 --> 00:29:41.410 A:middle L:90%
uh performance and power at the same time, we

510
00:29:41.410 --> 00:29:47.450 A:middle L:90%
have a model that then guides where we map the

511
00:29:47.940 --> 00:29:49.930 A:middle L:90%
the tasks onto the processors. That is is the

512
00:29:49.930 --> 00:29:53.710 A:middle L:90%
CPU better for this task or the gpu in your

513
00:29:53.710 --> 00:29:56.259 A:middle L:90%
own brain? It automatically does it it figures out

514
00:29:56.259 --> 00:29:59.740 A:middle L:90%
your left brain or right brain what you're using but

515
00:29:59.750 --> 00:30:03.000 A:middle L:90%
we don't have that luxury in this kind of system

516
00:30:03.009 --> 00:30:03.839 A:middle L:90%
. We actually have to figure out which one is

517
00:30:03.839 --> 00:30:07.390 A:middle L:90%
better suited for a particular task at hand involves a

518
00:30:07.390 --> 00:30:11.660 A:middle L:90%
lot of modeling involves some profiling to understand how different

519
00:30:11.670 --> 00:30:15.559 A:middle L:90%
things mapped to the underlying architecture demands an understanding of

520
00:30:15.569 --> 00:30:19.829 A:middle L:90%
how data structures are represented and how the algorithms map

521
00:30:19.829 --> 00:30:25.319 A:middle L:90%
underneath in order to get the good performance um in

522
00:30:25.319 --> 00:30:27.160 A:middle L:90%
the last box is this task scheduling system, the

523
00:30:27.160 --> 00:30:30.299 A:middle L:90%
idea here, is that a lot of the things

524
00:30:30.299 --> 00:30:33.869 A:middle L:90%
that you think that you would like to paralyze that

525
00:30:33.869 --> 00:30:37.400 A:middle L:90%
design and compile time cannot be fully on. All

526
00:30:37.400 --> 00:30:38.920 A:middle L:90%
the parallelism cannot be fully uncovered because there's all sorts

527
00:30:38.920 --> 00:30:42.099 A:middle L:90%
of data dependencies and control pads that you're not sure

528
00:30:42.099 --> 00:30:45.190 A:middle L:90%
about whether or not you're going down or not and

529
00:30:45.190 --> 00:30:48.539 A:middle L:90%
you won't be able to expose that parallelism until runtime

530
00:30:48.549 --> 00:30:49.799 A:middle L:90%
. And so we have a task scheduling system here

531
00:30:49.809 --> 00:30:55.789 A:middle L:90%
that will on the fly uh in real time decide

532
00:30:55.799 --> 00:31:00.779 A:middle L:90%
what mapping is to make between the tasks and the

533
00:31:00.779 --> 00:31:07.210 A:middle L:90%
Cpus and Gpus and other heterogeneous uh computing uh devices

534
00:31:07.210 --> 00:31:08.819 A:middle L:90%
in the environment. I'm not going to talk about

535
00:31:08.819 --> 00:31:11.619 A:middle L:90%
this last bullet here but this gives you a high

536
00:31:11.619 --> 00:31:15.170 A:middle L:90%
level view. The questions on this high level view

537
00:31:15.640 --> 00:31:18.960 A:middle L:90%
. I'm gonna You keep talking here a little bit

538
00:31:18.970 --> 00:31:22.240 A:middle L:90%
to stall and give you a chance to think about

539
00:31:22.250 --> 00:31:22.519 A:middle L:90%
if you have any questions. But this is this

540
00:31:22.519 --> 00:31:23.980 A:middle L:90%
is a high level of you and what I'm gonna

541
00:31:23.980 --> 00:31:26.650 A:middle L:90%
do is just gonna give you like 2-3 brief slides

542
00:31:26.650 --> 00:31:30.809 A:middle L:90%
on each the box is to give you an understanding

543
00:31:30.930 --> 00:31:33.359 A:middle L:90%
of what it is that we're doing in each space

544
00:31:34.940 --> 00:31:37.430 A:middle L:90%
. And what I will say is that a lot

545
00:31:37.430 --> 00:31:41.559 A:middle L:90%
of this work is it's they're all related right now

546
00:31:41.559 --> 00:31:45.079 A:middle L:90%
but they're not connected the way they're connected here.

547
00:31:45.150 --> 00:31:47.420 A:middle L:90%
Right? This is this is what we would love

548
00:31:47.420 --> 00:31:48.660 A:middle L:90%
to get to in the next 3 to 5 years

549
00:31:48.670 --> 00:31:52.339 A:middle L:90%
. Okay. And that's that's part of the recent

550
00:31:52.339 --> 00:31:55.900 A:middle L:90%
latest large grant that we that we've gotten but we

551
00:31:55.900 --> 00:32:02.670 A:middle L:90%
have pieces of this ecosystem put together. Okay,

552
00:32:04.940 --> 00:32:07.680 A:middle L:90%
so I'll talk very briefly about the dwarfs. So

553
00:32:07.680 --> 00:32:09.759 A:middle L:90%
this is I already gave you this example um an

554
00:32:09.759 --> 00:32:13.509 A:middle L:90%
example of a computational dwarfism, body and body problems

555
00:32:13.509 --> 00:32:15.980 A:middle L:90%
are studied in cosmology, particle physics biology engineering.

556
00:32:15.990 --> 00:32:20.460 A:middle L:90%
So we did one with the cosmology. Uh This

557
00:32:20.460 --> 00:32:22.009 A:middle L:90%
is outer space and then one with molecular modeling,

558
00:32:22.009 --> 00:32:23.809 A:middle L:90%
this is work with Professor alexei, you know every

559
00:32:23.809 --> 00:32:29.950 A:middle L:90%
f who's a professor in computer science here. Um

560
00:32:30.339 --> 00:32:34.150 A:middle L:90%
They all have similar structures and this benchmark can provide

561
00:32:34.150 --> 00:32:37.599 A:middle L:90%
meaningful insight to people in across these fields and the

562
00:32:37.599 --> 00:32:42.480 A:middle L:90%
optimizations that we apply in this particular realm also will

563
00:32:42.490 --> 00:32:44.829 A:middle L:90%
apply in this realm, even though they seem like

564
00:32:44.829 --> 00:32:47.630 A:middle L:90%
they're very different fundamentally from an algorithmic standpoint. They

565
00:32:47.630 --> 00:32:51.349 A:middle L:90%
share a number of similarities that allow us to apply

566
00:32:51.539 --> 00:32:54.950 A:middle L:90%
the same optimizations to both spaces. So the first

567
00:32:54.950 --> 00:32:59.849 A:middle L:90%
instance creation of the are dwarfs is uh it was

568
00:32:59.849 --> 00:33:01.259 A:middle L:90%
originally called open cl and 13 doors, since change

569
00:33:01.259 --> 00:33:05.640 A:middle L:90%
is called open dwarfs. Um So we provide common

570
00:33:05.640 --> 00:33:08.980 A:middle L:90%
algorithmic methods that is dwarfs in a language that's write

571
00:33:08.990 --> 00:33:12.480 A:middle L:90%
once run anywhere. Okay. And I should say

572
00:33:12.490 --> 00:33:15.190 A:middle L:90%
write once run anywhere in the city notion of the

573
00:33:15.190 --> 00:33:16.059 A:middle L:90%
word? Not in the job on the word.

574
00:33:16.440 --> 00:33:19.430 A:middle L:90%
Okay. So the idea is that if you have

575
00:33:19.430 --> 00:33:24.480 A:middle L:90%
a c environment uh tool environment on your target host

576
00:33:24.490 --> 00:33:30.700 A:middle L:90%
, you'll be able to uh recompile the the code

577
00:33:30.700 --> 00:33:35.980 A:middle L:90%
and run it on that new target platform. And

578
00:33:35.980 --> 00:33:37.740 A:middle L:90%
this is part of a larger umbrella project um for

579
00:33:37.740 --> 00:33:40.279 A:middle L:90%
the NSF Center for high performance, reconfigurable computing.

580
00:33:40.279 --> 00:33:45.029 A:middle L:90%
This is a joint uh effort by Computer Science and

581
00:33:45.029 --> 00:33:47.759 A:middle L:90%
Electrical and Computer Engineering, um and we are being

582
00:33:47.759 --> 00:33:55.309 A:middle L:90%
renewed for another five years by NSF. Um This

583
00:33:55.309 --> 00:34:00.279 A:middle L:90%
is where we're at right now. Um These are

584
00:34:00.279 --> 00:34:00.779 A:middle L:90%
the ones that are done, there are a few

585
00:34:00.779 --> 00:34:04.319 A:middle L:90%
that are still in progress that we're trying to complete

586
00:34:04.329 --> 00:34:07.559 A:middle L:90%
population of uh so that people have a full set

587
00:34:07.559 --> 00:34:09.190 A:middle L:90%
of dwarfs to look at what I think has been

588
00:34:09.190 --> 00:34:13.530 A:middle L:90%
interesting. I'm gonna throw I'm gonna open up a

589
00:34:13.530 --> 00:34:19.309 A:middle L:90%
Pandora's box because these applications are just there written by

590
00:34:19.309 --> 00:34:22.710 A:middle L:90%
hand, there's a non optimized um there's not a

591
00:34:22.710 --> 00:34:25.769 A:middle L:90%
target architecture in mind. Uh they're just generic algorithms

592
00:34:25.780 --> 00:34:28.880 A:middle L:90%
and we decided, well let's just take these and

593
00:34:28.880 --> 00:34:32.320 A:middle L:90%
see if we could use um use this to run

594
00:34:32.320 --> 00:34:36.250 A:middle L:90%
on other platforms. And the reason oh, let

595
00:34:36.250 --> 00:34:37.130 A:middle L:90%
me back up one minute. So the reason why

596
00:34:37.130 --> 00:34:40.130 A:middle L:90%
open Cl is the open computing language, is because

597
00:34:40.139 --> 00:34:44.090 A:middle L:90%
if you write your code and open C l you'll

598
00:34:44.090 --> 00:34:46.550 A:middle L:90%
be able to run at on any supporting infrastructure that

599
00:34:46.550 --> 00:34:50.219 A:middle L:90%
supports open Ceo, just like any infrastructure to support

600
00:34:50.219 --> 00:34:52.070 A:middle L:90%
C. But you can run it right now on

601
00:34:52.079 --> 00:34:58.199 A:middle L:90%
Cpus Gpus, F P G A S ap us

602
00:34:58.199 --> 00:35:00.949 A:middle L:90%
, which are accelerated processing units, you don't have

603
00:35:00.949 --> 00:35:04.969 A:middle L:90%
to rewrite your code across all of these as little

604
00:35:04.969 --> 00:35:06.900 A:middle L:90%
as two years ago, you would be writing p

605
00:35:06.900 --> 00:35:09.190 A:middle L:90%
thread codes for your see your CPU and then you

606
00:35:09.190 --> 00:35:13.409 A:middle L:90%
would have to report what parts you want to accelerate

607
00:35:13.599 --> 00:35:16.079 A:middle L:90%
and write it in something called cuda and you run

608
00:35:16.079 --> 00:35:17.969 A:middle L:90%
that on your Gps. Oh, and then if

609
00:35:17.969 --> 00:35:20.449 A:middle L:90%
you want to run on f P G A s

610
00:35:20.460 --> 00:35:22.159 A:middle L:90%
, you have to translate that and running right,

611
00:35:22.170 --> 00:35:24.949 A:middle L:90%
uh in V H D L or very log.

612
00:35:27.340 --> 00:35:29.869 A:middle L:90%
So the portability aspect of being able to run any

613
00:35:29.869 --> 00:35:31.619 A:middle L:90%
parallel computing devices just gets to be a pain.

614
00:35:31.750 --> 00:35:34.889 A:middle L:90%
And so it turns out that Apple was the one

615
00:35:34.889 --> 00:35:38.179 A:middle L:90%
that came up with this open cl language because they

616
00:35:38.179 --> 00:35:44.170 A:middle L:90%
were leveraging the graphics processing unit to run a system

617
00:35:45.940 --> 00:35:49.760 A:middle L:90%
. So I come up with open cl well using

618
00:35:49.760 --> 00:35:52.710 A:middle L:90%
the GPU to some operating system tasks. And so

619
00:35:52.719 --> 00:35:55.630 A:middle L:90%
when they had a video GPU, they were programming

620
00:35:55.630 --> 00:36:00.079 A:middle L:90%
in cuDA. When they decided that after the three

621
00:36:00.079 --> 00:36:01.030 A:middle L:90%
year contract was over, they said, oh wait

622
00:36:01.039 --> 00:36:02.320 A:middle L:90%
, you know, we want to use an AMG

623
00:36:02.320 --> 00:36:06.440 A:middle L:90%
GPU because it's cheaper because and video, it feels

624
00:36:06.440 --> 00:36:08.480 A:middle L:90%
like they have us over a barrel. Um and

625
00:36:08.480 --> 00:36:09.929 A:middle L:90%
so we're gonna switch to an AMG well then they

626
00:36:09.929 --> 00:36:13.960 A:middle L:90%
have to switch to brook GPU programming language. So

627
00:36:14.429 --> 00:36:16.219 A:middle L:90%
rather than have to deal with this overhead of having

628
00:36:16.219 --> 00:36:21.010 A:middle L:90%
to continually translate back and forth, they created open

629
00:36:21.010 --> 00:36:22.559 A:middle L:90%
seal as a common language that would just run on

630
00:36:22.559 --> 00:36:25.429 A:middle L:90%
Cpus Gpus and what have you. So you can

631
00:36:25.429 --> 00:36:29.730 A:middle L:90%
imagine the Macbook airs, there's not an extra GPU

632
00:36:29.730 --> 00:36:31.320 A:middle L:90%
to run uh the operating system on it will just

633
00:36:31.329 --> 00:36:35.079 A:middle L:90%
back off and run onto the CPU, it will

634
00:36:35.079 --> 00:36:37.219 A:middle L:90%
run on any parallel computing device that you have underneath

635
00:36:37.219 --> 00:36:39.179 A:middle L:90%
it. So we did this and we just tried

636
00:36:39.179 --> 00:36:42.309 A:middle L:90%
to do this and this is just a really small

637
00:36:42.309 --> 00:36:45.659 A:middle L:90%
digest version. We tried it on uh we use

638
00:36:45.659 --> 00:36:49.000 A:middle L:90%
the Open Sea el software development kit from A.

639
00:36:49.000 --> 00:36:50.710 A:middle L:90%
M. D. And from intel and we ran

640
00:36:50.710 --> 00:36:52.159 A:middle L:90%
out on intel processors. This is just a straightforward

641
00:36:52.159 --> 00:36:55.380 A:middle L:90%
benchmarking exercise. And you see it's what's pretty interesting

642
00:36:55.380 --> 00:36:59.159 A:middle L:90%
is the amount of time that it takes to do

643
00:36:59.170 --> 00:37:02.239 A:middle L:90%
each one of the different dwarfs. Um So,

644
00:37:02.250 --> 00:37:04.679 A:middle L:90%
you know, C F D is, is a

645
00:37:04.679 --> 00:37:07.030 A:middle L:90%
structured grid dwarf, and you see that the AMG

646
00:37:07.030 --> 00:37:10.559 A:middle L:90%
has much better performance, it completes in two million

647
00:37:10.570 --> 00:37:15.269 A:middle L:90%
less than two milliseconds, whereas the intel one largely

648
00:37:15.269 --> 00:37:17.630 A:middle L:90%
completes in four milliseconds. That's a that's a that's

649
00:37:17.630 --> 00:37:21.150 A:middle L:90%
a quartile box between the 1st and 3rd quartile,

650
00:37:21.429 --> 00:37:22.989 A:middle L:90%
But there's some significant outliers all the way out at

651
00:37:22.989 --> 00:37:28.239 A:middle L:90%
12:00. Yeah, what we ought to do it

652
00:37:28.239 --> 00:37:29.570 A:middle L:90%
, but we haven't had time to do is really

653
00:37:29.570 --> 00:37:32.159 A:middle L:90%
trying to understand why this is going on, but

654
00:37:32.170 --> 00:37:35.659 A:middle L:90%
we just really put this out there for people to

655
00:37:36.030 --> 00:37:38.730 A:middle L:90%
, to to catalyze further research in this, we

656
00:37:38.730 --> 00:37:40.510 A:middle L:90%
don't have time for say to do it ourselves,

657
00:37:40.510 --> 00:37:44.010 A:middle L:90%
although, and if you are interested, uh we

658
00:37:44.010 --> 00:37:45.539 A:middle L:90%
can get you hooked up and get going. Um

659
00:37:45.550 --> 00:37:49.170 A:middle L:90%
So, other interesting things is that it depends on

660
00:37:49.170 --> 00:37:52.489 A:middle L:90%
the dwarf, you see over here, until platform

661
00:37:52.500 --> 00:37:54.960 A:middle L:90%
for the Fft fast way to transform, which is

662
00:37:54.960 --> 00:38:00.030 A:middle L:90%
a spectral method does quite well, it's 0.7 milliseconds

663
00:38:00.489 --> 00:38:02.599 A:middle L:90%
and that the AMG one does quite poorly and this

664
00:38:02.599 --> 00:38:07.360 A:middle L:90%
is on the CPU we're running open cl code on

665
00:38:07.360 --> 00:38:09.449 A:middle L:90%
the CPU in this case and we thought that it

666
00:38:09.449 --> 00:38:12.820 A:middle L:90%
would be, we thought for sure that the intel

667
00:38:12.820 --> 00:38:15.500 A:middle L:90%
platform would win across the board because even though we

668
00:38:15.500 --> 00:38:19.610 A:middle L:90%
were using vendor specific sDK s A M D S

669
00:38:19.610 --> 00:38:21.019 A:middle L:90%
, D. K. And intel sDK we were

670
00:38:21.019 --> 00:38:25.010 A:middle L:90%
running it on intel hardware, so it turns out

671
00:38:25.010 --> 00:38:28.449 A:middle L:90%
this case, that isn't always the case that the

672
00:38:28.449 --> 00:38:30.699 A:middle L:90%
intel will run better even though we have it rigged

673
00:38:30.769 --> 00:38:32.849 A:middle L:90%
in some sense to run better on intel. All

674
00:38:32.849 --> 00:38:39.460 A:middle L:90%
right. So that's just a flavor. Uh I'll

675
00:38:39.460 --> 00:38:42.449 A:middle L:90%
give you another perspective on the power, I just

676
00:38:42.449 --> 00:38:45.010 A:middle L:90%
talked about performance. This is a power graph.

677
00:38:45.019 --> 00:38:45.489 A:middle L:90%
So this is an F P. This is a

678
00:38:45.489 --> 00:38:49.420 A:middle L:90%
CPU you can see it F P G A which

679
00:38:49.420 --> 00:38:52.500 A:middle L:90%
is a programmable processor. You can program the devices

680
00:38:52.500 --> 00:38:53.519 A:middle L:90%
on the F P G A to configure how you

681
00:38:53.519 --> 00:38:57.610 A:middle L:90%
want them. These are two high powered Gpu s

682
00:38:57.619 --> 00:39:01.530 A:middle L:90%
and you can see here are too low powered Gpus

683
00:39:01.530 --> 00:39:04.730 A:middle L:90%
and Cpus. So this Mac mini has a CPU

684
00:39:04.730 --> 00:39:06.239 A:middle L:90%
and Gpu. It's not on the same die,

685
00:39:06.239 --> 00:39:07.199 A:middle L:90%
but it's on the same package. You can see

686
00:39:07.199 --> 00:39:10.960 A:middle L:90%
how energy efficient that is. So, when I

687
00:39:10.960 --> 00:39:14.659 A:middle L:90%
talk about heterogeneous parallel computing, I really am talking

688
00:39:14.659 --> 00:39:17.510 A:middle L:90%
about it from the perspective of small mobile embedded spaces

689
00:39:17.510 --> 00:39:20.900 A:middle L:90%
, desktop spaces all the way up to high performance

690
00:39:20.900 --> 00:39:23.059 A:middle L:90%
computing spaces such as these. And this one is

691
00:39:23.139 --> 00:39:25.760 A:middle L:90%
hard to see you there and if you want to

692
00:39:25.760 --> 00:39:30.090 A:middle L:90%
find out more, um we have a number of

693
00:39:30.099 --> 00:39:32.340 A:middle L:90%
publications on the different dwarfs, uh you just go

694
00:39:32.340 --> 00:39:35.780 A:middle L:90%
to synergy dot CS dot VT dot e d U

695
00:39:35.780 --> 00:39:37.489 A:middle L:90%
. And just click on publications and you can do

696
00:39:37.489 --> 00:39:40.389 A:middle L:90%
a search for that. Um the we have a

697
00:39:40.389 --> 00:39:45.690 A:middle L:90%
summary paper that that was submitted and published this past

698
00:39:45.690 --> 00:39:49.869 A:middle L:90%
year. It's just a brief uh research note.

699
00:39:49.880 --> 00:39:52.300 A:middle L:90%
It's only four pages that gets you uh orientation in

700
00:39:52.300 --> 00:39:55.139 A:middle L:90%
terms of what we're doing and then the publications that

701
00:39:55.139 --> 00:39:59.170 A:middle L:90%
are underneath it are are deeper dives for a particular

702
00:39:59.179 --> 00:40:05.619 A:middle L:90%
dwarf. Well, okay, so I'm gonna move

703
00:40:05.619 --> 00:40:08.039 A:middle L:90%
on to the next box. Um but before I

704
00:40:08.039 --> 00:40:09.579 A:middle L:90%
do, I want to point out that this is

705
00:40:09.579 --> 00:40:13.280 A:middle L:90%
what we have of the dwarfs. And one of

706
00:40:13.280 --> 00:40:15.079 A:middle L:90%
the problems was that in order to generate all these

707
00:40:15.079 --> 00:40:20.469 A:middle L:90%
dwarfs, it took two years of students uh contributing

708
00:40:20.480 --> 00:40:23.559 A:middle L:90%
to this sweet to develop. Now it wasn't two

709
00:40:23.559 --> 00:40:25.239 A:middle L:90%
years of full time student. I mean you all

710
00:40:25.239 --> 00:40:28.619 A:middle L:90%
have taken classes, you go on internships and things

711
00:40:28.619 --> 00:40:30.539 A:middle L:90%
like that, but this is just getting an idea

712
00:40:30.539 --> 00:40:31.289 A:middle L:90%
. It took us a while to do this and

713
00:40:31.289 --> 00:40:34.809 A:middle L:90%
on top of that, um when I noticed that

714
00:40:34.820 --> 00:40:37.260 A:middle L:90%
we were able to uh once you Take the notion

715
00:40:37.260 --> 00:40:39.449 A:middle L:90%
of the code and try to optimize it with respect

716
00:40:39.449 --> 00:40:42.869 A:middle L:90%
to the underlying target architecture, we were able to

717
00:40:42.869 --> 00:40:45.150 A:middle L:90%
get an additional 4.2-fold speed up. So we went

718
00:40:45.150 --> 00:40:47.760 A:middle L:90%
from 88 fold to 3 71 fold. So what

719
00:40:47.760 --> 00:40:52.610 A:middle L:90%
can we do with respect to leveraging these dwarfs uh

720
00:40:52.619 --> 00:40:53.820 A:middle L:90%
to affect better performance? Well, we'd like to

721
00:40:53.820 --> 00:40:57.010 A:middle L:90%
get the time that it takes to develop these things

722
00:40:57.010 --> 00:41:00.409 A:middle L:90%
down and we'd also like to eliminate the need for

723
00:41:00.409 --> 00:41:02.099 A:middle L:90%
the end user to have to optimize and that's where

724
00:41:02.099 --> 00:41:07.739 A:middle L:90%
we get into this source of source translation and optimization

725
00:41:07.739 --> 00:41:10.900 A:middle L:90%
framework. Okay, so we're not going to try

726
00:41:10.900 --> 00:41:15.360 A:middle L:90%
and be the universal translator. That's not something that's

727
00:41:15.360 --> 00:41:16.849 A:middle L:90%
probably ever going to be done in my lifetime or

728
00:41:16.860 --> 00:41:21.409 A:middle L:90%
likely your lifetime as well. But um we're looking

729
00:41:21.409 --> 00:41:25.110 A:middle L:90%
in particular at codes that have been cuda accelerated uh

730
00:41:25.119 --> 00:41:29.789 A:middle L:90%
on video gps and we want them to be translated

731
00:41:29.789 --> 00:41:32.539 A:middle L:90%
to a language that will run anywhere and that would

732
00:41:32.539 --> 00:41:35.960 A:middle L:90%
be open cl and so if you take a look

733
00:41:35.960 --> 00:41:38.130 A:middle L:90%
at the prevalence of code, you see a cuda

734
00:41:38.130 --> 00:41:42.949 A:middle L:90%
code, um, I should update these numbers but

735
00:41:42.949 --> 00:41:45.059 A:middle L:90%
you can see that the cuda source code has had

736
00:41:45.059 --> 00:41:46.840 A:middle L:90%
over a million results. This was back a few

737
00:41:46.840 --> 00:41:50.739 A:middle L:90%
years ago actually and then open cl source code there

738
00:41:50.739 --> 00:41:53.380 A:middle L:90%
was were significantly less. And so if you went

739
00:41:53.380 --> 00:41:55.989 A:middle L:90%
to the video website, you just see all these

740
00:41:55.989 --> 00:42:01.820 A:middle L:90%
applications that are being accelerated. Um and many of

741
00:42:01.820 --> 00:42:06.510 A:middle L:90%
them are captured by some aspects of the dwarfs that

742
00:42:06.510 --> 00:42:10.530 A:middle L:90%
I've been talking about. Um, so everything from

743
00:42:10.530 --> 00:42:14.719 A:middle L:90%
more scientific types of things like life sciences, um

744
00:42:15.099 --> 00:42:19.980 A:middle L:90%
, uh two government and defense to financial markets to

745
00:42:19.980 --> 00:42:22.889 A:middle L:90%
oil and gas. So what we did was we

746
00:42:22.900 --> 00:42:28.530 A:middle L:90%
created something called Cuticle, it's a coded open cl

747
00:42:28.530 --> 00:42:30.760 A:middle L:90%
sources source translator. It's implemented as a clang plug

748
00:42:30.760 --> 00:42:35.460 A:middle L:90%
in. Uh Klang is a, is a production

749
00:42:35.460 --> 00:42:38.530 A:middle L:90%
quality compiler framework and and we leverage is heavily because

750
00:42:38.530 --> 00:42:42.489 A:middle L:90%
it's like there's no sense in reinventing the wheel,

751
00:42:42.500 --> 00:42:46.769 A:middle L:90%
These people are compiler experts, we know uh we

752
00:42:46.769 --> 00:42:51.090 A:middle L:90%
know that what they've done is been tested and uh

753
00:42:51.099 --> 00:42:55.139 A:middle L:90%
so we're relying on that framework and then we make

754
00:42:55.139 --> 00:43:00.170 A:middle L:90%
Collins to this clang framework to support the primary cuda

755
00:43:00.170 --> 00:43:02.989 A:middle L:90%
constructs found in cutesy and the cuda runtime Api.

756
00:43:04.269 --> 00:43:07.239 A:middle L:90%
And what we found is that the codes that were

757
00:43:07.239 --> 00:43:09.659 A:middle L:90%
manually ported that you saw on the previous page that

758
00:43:09.659 --> 00:43:14.059 A:middle L:90%
were manually ported or written uh that we could we

759
00:43:14.059 --> 00:43:15.119 A:middle L:90%
could take the ones that they reported. We could

760
00:43:15.119 --> 00:43:19.179 A:middle L:90%
do is automatically and get the same performance. So

761
00:43:19.190 --> 00:43:21.820 A:middle L:90%
the work that took two years for students to do

762
00:43:21.869 --> 00:43:25.750 A:middle L:90%
, we can do in seconds now. So the

763
00:43:25.760 --> 00:43:30.250 A:middle L:90%
students have been obsolescence. I'm just kidding. We

764
00:43:30.250 --> 00:43:34.760 A:middle L:90%
of course the students that in fact there was a

765
00:43:34.760 --> 00:43:37.400 A:middle L:90%
very bright uh integrate uh integrated B S M s

766
00:43:37.400 --> 00:43:42.920 A:middle L:90%
student who worked on this project uh to realize it

767
00:43:42.929 --> 00:43:45.980 A:middle L:90%
as an automated translator, There have been many requests

768
00:43:45.980 --> 00:43:49.780 A:middle L:90%
for other translators, but there's only so many,

769
00:43:49.789 --> 00:43:52.440 A:middle L:90%
so much time in a day. Um we we

770
00:43:52.449 --> 00:43:53.809 A:middle L:90%
we bit off the largest chunk and that was recruited

771
00:43:53.809 --> 00:43:58.440 A:middle L:90%
open cl but there's been movements to getting open cl

772
00:43:58.440 --> 00:44:00.639 A:middle L:90%
to Kuta and open MPI open C. L as

773
00:44:00.639 --> 00:44:05.320 A:middle L:90%
well. Um I'm not expecting everyone to understand this

774
00:44:05.320 --> 00:44:07.110 A:middle L:90%
picture. I'm just, this is just a high

775
00:44:07.110 --> 00:44:12.309 A:middle L:90%
level overview. Um we're really making heavy uh um

776
00:44:12.320 --> 00:44:15.559 A:middle L:90%
reliance on this clang framework. And so we're making

777
00:44:15.559 --> 00:44:17.889 A:middle L:90%
use of their traverse, identify and rewriting libraries in

778
00:44:17.889 --> 00:44:22.599 A:middle L:90%
order to take this cuda source code over here and

779
00:44:22.610 --> 00:44:28.340 A:middle L:90%
generate appropriate open cl host file code and open C

780
00:44:28.340 --> 00:44:30.800 A:middle L:90%
L kernel file code. So the host code runs

781
00:44:30.800 --> 00:44:32.150 A:middle L:90%
on the CPU and the kernel file codes run on

782
00:44:32.150 --> 00:44:43.050 A:middle L:90%
the GPU. Um City. Oops, so just

783
00:44:43.050 --> 00:44:45.579 A:middle L:90%
to give an idea of how this is done so

784
00:44:45.579 --> 00:44:47.329 A:middle L:90%
far, um it's done pretty well. So the

785
00:44:47.329 --> 00:44:50.420 A:middle L:90%
number of lines of code is not the total lines

786
00:44:50.420 --> 00:44:51.630 A:middle L:90%
of code, this is the number of co two

787
00:44:51.630 --> 00:44:53.829 A:middle L:90%
lines of code. Um so for example, uh

788
00:44:53.840 --> 00:44:58.940 A:middle L:90%
professor of movies embody molecular modeling code is about 7500

789
00:44:58.940 --> 00:45:00.929 A:middle L:90%
lines of code, of which 2500 some are for

790
00:45:00.929 --> 00:45:04.690 A:middle L:90%
the GPU. And we were able to translate all

791
00:45:04.690 --> 00:45:08.400 A:middle L:90%
but five of those automatically. So you can imagine

792
00:45:08.400 --> 00:45:10.920 A:middle L:90%
having to do this manually by hand versus oh wait

793
00:45:12.789 --> 00:45:14.269 A:middle L:90%
, It's all done. Other than five lines,

794
00:45:14.269 --> 00:45:15.420 A:middle L:90%
I have to go in and twiddle five lines and

795
00:45:15.429 --> 00:45:19.260 A:middle L:90%
make it run, We've since done this, we

796
00:45:19.260 --> 00:45:22.289 A:middle L:90%
were up to 100 applications now. Um it's not

797
00:45:22.289 --> 00:45:24.929 A:middle L:90%
perfect, but it helps you get a good part

798
00:45:24.929 --> 00:45:27.869 A:middle L:90%
of the way. We don't do any transformations to

799
00:45:27.869 --> 00:45:29.960 A:middle L:90%
the code. We're doing source the source. So

800
00:45:29.960 --> 00:45:32.170 A:middle L:90%
what you have in cuda code still looks the same

801
00:45:32.179 --> 00:45:35.469 A:middle L:90%
when you have open cl code. And so you

802
00:45:35.469 --> 00:45:37.889 A:middle L:90%
can develop then from the open cl code at that

803
00:45:37.900 --> 00:45:42.550 A:middle L:90%
point. For more information about that. Uh you

804
00:45:42.550 --> 00:45:47.730 A:middle L:90%
can go to uh this publication um on cuda to

805
00:45:47.730 --> 00:46:00.159 A:middle L:90%
open cl sources source translation. Now That takes care

806
00:46:00.159 --> 00:46:05.179 A:middle L:90%
of the 2009 to 2011 time that it took to

807
00:46:05.179 --> 00:46:08.210 A:middle L:90%
develop the dwarfs and compressed it to seconds. Okay

808
00:46:08.679 --> 00:46:12.840 A:middle L:90%
, um what about the performance aspect? We saw

809
00:46:12.840 --> 00:46:16.050 A:middle L:90%
that 88 fold, 371 fold improvement. Well,

810
00:46:16.050 --> 00:46:20.000 A:middle L:90%
that's where the architecture where optimizations came in and and

811
00:46:20.010 --> 00:46:22.559 A:middle L:90%
this part we have not automated. This is something

812
00:46:22.559 --> 00:46:24.530 A:middle L:90%
that we're looking to automate. But I wanted to

813
00:46:24.530 --> 00:46:29.420 A:middle L:90%
give you just a brief flavor of what it is

814
00:46:29.420 --> 00:46:30.989 A:middle L:90%
that we're doing on architecture optimizations. And so,

815
00:46:31.280 --> 00:46:35.199 A:middle L:90%
uh, this is probably hopefully self evident, but

816
00:46:35.210 --> 00:46:37.329 A:middle L:90%
the performance of different CPUS are not equivalent. Neither

817
00:46:37.329 --> 00:46:40.489 A:middle L:90%
are different GPS. Um, if you take a

818
00:46:40.489 --> 00:46:45.599 A:middle L:90%
look at the peak performance of an NVIDIA Gpu versus

819
00:46:45.599 --> 00:46:47.900 A:middle L:90%
an AMG Gpu for single precision floating performance. You

820
00:46:47.900 --> 00:46:52.559 A:middle L:90%
see that the AMG GPU is supposed to be theoretically

821
00:46:52.559 --> 00:46:57.190 A:middle L:90%
twice as fast as the NVIDIA Gpu. Yeah.

822
00:46:57.679 --> 00:46:59.070 A:middle L:90%
When you come over here and you look at the

823
00:46:59.070 --> 00:47:00.570 A:middle L:90%
speed up with respect to hand tune SSC, which

824
00:47:00.570 --> 00:47:05.869 A:middle L:90%
is uh, you're using essentially the vector rising extensions

825
00:47:05.880 --> 00:47:07.670 A:middle L:90%
in a serial processor. You see that we got

826
00:47:07.670 --> 00:47:10.670 A:middle L:90%
328-fold speed up on the GTX to 80 when we

827
00:47:10.670 --> 00:47:16.530 A:middle L:90%
moved it to a newer, allegedly faster GPU card

828
00:47:17.280 --> 00:47:22.309 A:middle L:90%
, we took a performance hit. And the reason

829
00:47:22.309 --> 00:47:27.420 A:middle L:90%
is the optimizations that were valid for this older and

830
00:47:27.420 --> 00:47:30.769 A:middle L:90%
video architecture don't necessarily apply anymore to the newer architecture

831
00:47:30.869 --> 00:47:34.369 A:middle L:90%
. They applied well enough that you still get about

832
00:47:34.369 --> 00:47:36.659 A:middle L:90%
the same performance, but they don't necessarily apply enough

833
00:47:36.670 --> 00:47:38.059 A:middle L:90%
that you have better performance and it was even worse

834
00:47:38.059 --> 00:47:43.210 A:middle L:90%
with the AMG Gpu. Okay. Um it turns

835
00:47:43.210 --> 00:47:45.050 A:middle L:90%
out that some of the, some of the optimizations

836
00:47:45.050 --> 00:47:47.949 A:middle L:90%
did apply, you got some speed up but you

837
00:47:47.949 --> 00:47:51.639 A:middle L:90%
didn't get as much speed up as you would expect

838
00:47:51.639 --> 00:47:54.300 A:middle L:90%
because the peak performance powers is supposed to be twice

839
00:47:54.300 --> 00:47:58.030 A:middle L:90%
as good as the NVIDIA. So if we were

840
00:47:58.030 --> 00:48:00.000 A:middle L:90%
to extrapolate here, you would expect 328 times two

841
00:48:00.010 --> 00:48:04.269 A:middle L:90%
. You would want it to be whatever is 656

842
00:48:04.269 --> 00:48:07.000 A:middle L:90%
times faster, but it's only 224. Okay.

843
00:48:08.269 --> 00:48:13.949 A:middle L:90%
Um so uh I'm just gonna skip that one.

844
00:48:13.949 --> 00:48:15.429 A:middle L:90%
Let's go here. So, so what we ended

845
00:48:15.429 --> 00:48:19.869 A:middle L:90%
up doing is we've identified a number of uh manually

846
00:48:19.880 --> 00:48:22.920 A:middle L:90%
my human compiler uh and I we worked and found

847
00:48:22.920 --> 00:48:27.570 A:middle L:90%
a bunch of different manual optimizations with respect to optimize

848
00:48:27.570 --> 00:48:29.690 A:middle L:90%
AMG Gps. And I should say that this is

849
00:48:29.690 --> 00:48:30.789 A:middle L:90%
for the old generation because the new generation, they

850
00:48:30.789 --> 00:48:34.829 A:middle L:90%
completely yank the rug out from under us and have

851
00:48:34.840 --> 00:48:39.949 A:middle L:90%
changed the underlying architecture completely, but you don't have

852
00:48:39.949 --> 00:48:45.500 A:middle L:90%
to understand all of the different uh optimizations here.

853
00:48:45.869 --> 00:48:47.849 A:middle L:90%
Um all you're going to point out is um that

854
00:48:47.860 --> 00:48:52.210 A:middle L:90%
we applied a lot of these in isolation as well

855
00:48:52.210 --> 00:48:54.250 A:middle L:90%
as in combination. And what you see is that

856
00:48:54.250 --> 00:49:00.820 A:middle L:90%
we see different speed ups entailed by the different compiler

857
00:49:00.820 --> 00:49:05.769 A:middle L:90%
optimizations, I guess. Maybe I'll just take one

858
00:49:05.769 --> 00:49:07.809 A:middle L:90%
point is like, here's an example, loop unrolling

859
00:49:07.809 --> 00:49:08.679 A:middle L:90%
two way or four way. So what that does

860
00:49:08.679 --> 00:49:12.630 A:middle L:90%
is you have an iterative loop And you executed 100

861
00:49:12.630 --> 00:49:14.860 A:middle L:90%
times. Well, one way you can do it

862
00:49:14.869 --> 00:49:16.349 A:middle L:90%
is to avoid having the overhead of having to do

863
00:49:16.349 --> 00:49:19.710 A:middle L:90%
with the branching back to the top, you unroll

864
00:49:19.710 --> 00:49:22.900 A:middle L:90%
it a second time, so you take that loop

865
00:49:22.909 --> 00:49:24.539 A:middle L:90%
and you replicate it now, instead of doing that

866
00:49:24.539 --> 00:49:28.739 A:middle L:90%
single loop 100 times, you're doing two copies of

867
00:49:28.739 --> 00:49:32.840 A:middle L:90%
loop 50 times and you've eliminated a branch point.

868
00:49:32.960 --> 00:49:35.800 A:middle L:90%
Now the trade off there of course, is that

869
00:49:35.809 --> 00:49:37.190 A:middle L:90%
now your code is larger, it's going to take

870
00:49:37.190 --> 00:49:40.170 A:middle L:90%
up more space. It may then push out data

871
00:49:40.170 --> 00:49:44.019 A:middle L:90%
or other code that you need to execute and push

872
00:49:44.019 --> 00:49:45.820 A:middle L:90%
it out to disk, which then impacts your performance

873
00:49:45.840 --> 00:49:50.219 A:middle L:90%
with respect to the code because you're swapping from memory

874
00:49:50.219 --> 00:49:54.199 A:middle L:90%
to disk. This is all trade offs. Okay

875
00:49:54.210 --> 00:49:58.409 A:middle L:90%
. So we then combine them in different ways and

876
00:49:58.420 --> 00:50:01.349 A:middle L:90%
ultimately we ended up with uh combining these particular ones

877
00:50:01.349 --> 00:50:04.639 A:middle L:90%
, we got a 4.2 fold speed up when my

878
00:50:04.639 --> 00:50:07.409 A:middle L:90%
student did the first part, he was really excited

879
00:50:07.409 --> 00:50:08.059 A:middle L:90%
like, oh, now we can combine all these

880
00:50:08.059 --> 00:50:10.909 A:middle L:90%
, we'll get this huge multiplication speed up. Yeah

881
00:50:10.909 --> 00:50:13.590 A:middle L:90%
, I wish, I wish it was that easy

882
00:50:13.599 --> 00:50:15.460 A:middle L:90%
. Um It turns out that some of these optimizations

883
00:50:15.469 --> 00:50:20.980 A:middle L:90%
are are conflicting with one another and so you have

884
00:50:20.980 --> 00:50:22.579 A:middle L:90%
to find out which ones figure out which ones are

885
00:50:22.579 --> 00:50:27.650 A:middle L:90%
more amenable to combination. And so ultimately, here's

886
00:50:27.650 --> 00:50:30.889 A:middle L:90%
the summary slide for the architecture optimizations. So we

887
00:50:30.889 --> 00:50:35.260 A:middle L:90%
went with a basic implementation, that's the speed up

888
00:50:35.260 --> 00:50:36.719 A:middle L:90%
over a hand tune S. S. C.

889
00:50:36.719 --> 00:50:39.469 A:middle L:90%
Code architecture underwear means we'll just use the optimizations for

890
00:50:39.469 --> 00:50:43.429 A:middle L:90%
the other platform on the current platform. You swap

891
00:50:43.429 --> 00:50:46.039 A:middle L:90%
them, just see you see how well they do

892
00:50:46.050 --> 00:50:50.329 A:middle L:90%
. And so some of the optimizations do transcend the

893
00:50:50.329 --> 00:50:52.840 A:middle L:90%
architecture and that they apply to both architecture. But

894
00:50:52.840 --> 00:50:54.889 A:middle L:90%
then when when we actually tune it with respect to

895
00:50:54.889 --> 00:50:58.909 A:middle L:90%
the individual architecture, um so you remember that original

896
00:50:58.909 --> 00:51:00.969 A:middle L:90%
2 24 now is up to 3 71 fold speed

897
00:51:00.969 --> 00:51:12.989 A:middle L:90%
up Barbara? Yes, like where? Yes.

898
00:51:14.659 --> 00:51:17.730 A:middle L:90%
Excellent question. So the question was uh with the

899
00:51:17.739 --> 00:51:21.349 A:middle L:90%
Gpu is changing so quickly, how much, how

900
00:51:21.349 --> 00:51:22.480 A:middle L:90%
much do we get out of these optimizations? We

901
00:51:22.489 --> 00:51:24.820 A:middle L:90%
get it for about a year and then yeah,

902
00:51:24.829 --> 00:51:27.199 A:middle L:90%
and then we have to go on to the other

903
00:51:27.199 --> 00:51:30.300 A:middle L:90%
ones. Uh So the the answer is, I

904
00:51:30.300 --> 00:51:37.809 A:middle L:90%
mean yeah, like 18-24 months. However, because

905
00:51:37.809 --> 00:51:39.409 A:middle L:90%
of the way that the heterogeneous computing environments, particularly

906
00:51:39.409 --> 00:51:43.920 A:middle L:90%
Gpus have been evolving. They've been converging more to

907
00:51:43.920 --> 00:51:47.539 A:middle L:90%
similar are similar like architecture so much like the way

908
00:51:47.539 --> 00:51:52.369 A:middle L:90%
you see intel and am D Cpus are reasonably the

909
00:51:52.369 --> 00:51:54.190 A:middle L:90%
same. Um They may be implemented under the covers

910
00:51:54.190 --> 00:51:57.449 A:middle L:90%
a little bit differently but they they still have some

911
00:51:57.449 --> 00:52:00.949 A:middle L:90%
of the same artifacts the notion or the hope is

912
00:52:00.960 --> 00:52:06.269 A:middle L:90%
is that many of the architecture independent optimizations that are

913
00:52:06.280 --> 00:52:09.829 A:middle L:90%
across all GPS will then apply and that the library

914
00:52:09.829 --> 00:52:16.769 A:middle L:90%
or set of architecture dependent optimizations will be hopefully smaller

915
00:52:17.849 --> 00:52:20.489 A:middle L:90%
. Um The tricky your part is going to be

916
00:52:20.500 --> 00:52:22.949 A:middle L:90%
is a lot of the performance modeling in terms of

917
00:52:22.949 --> 00:52:25.039 A:middle L:90%
movement of data because right now the Gpu, a

918
00:52:25.039 --> 00:52:28.510 A:middle L:90%
discrete Gpu like on hoagie speed sits out on the

919
00:52:28.510 --> 00:52:30.530 A:middle L:90%
Pc I express for us. So you have this

920
00:52:30.539 --> 00:52:31.880 A:middle L:90%
overhead of moving data back and forth to compute.

921
00:52:32.750 --> 00:52:36.329 A:middle L:90%
And there are other architectures now that put the CPU

922
00:52:36.329 --> 00:52:38.019 A:middle L:90%
and GPU on the same guy, it's like,

923
00:52:38.030 --> 00:52:40.340 A:middle L:90%
oh wait, then I'll have to move the data

924
00:52:40.349 --> 00:52:44.829 A:middle L:90%
. That's true. But you have fewer Gpu cores

925
00:52:44.840 --> 00:52:47.659 A:middle L:90%
and the memory that you're using is slower on the

926
00:52:47.659 --> 00:52:50.820 A:middle L:90%
discrete one that memory space. So you start to

927
00:52:50.829 --> 00:52:52.139 A:middle L:90%
think about, oh wait, you've got to model

928
00:52:52.139 --> 00:52:53.719 A:middle L:90%
all this so you can figure out All right.

929
00:52:53.730 --> 00:52:58.539 A:middle L:90%
If I know this is what the application signatures like

930
00:52:58.550 --> 00:53:00.760 A:middle L:90%
, I know that it's gonna bang on the CPU

931
00:53:00.769 --> 00:53:02.079 A:middle L:90%
a certain amount or the computing part, a certain

932
00:53:02.079 --> 00:53:05.269 A:middle L:90%
amount of memory. Certain amount. I have a

933
00:53:05.269 --> 00:53:07.849 A:middle L:90%
better idea of the costs involved in moving data.

934
00:53:07.860 --> 00:53:10.360 A:middle L:90%
Computing data in terms of trying to decide if I

935
00:53:10.360 --> 00:53:13.239 A:middle L:90%
want to use the GPU cores that are on the

936
00:53:13.239 --> 00:53:16.670 A:middle L:90%
die or that our remote on the discrete Gpu roger

937
00:53:19.849 --> 00:53:24.480 A:middle L:90%
in the last Yeah. Mhm problem in the course

938
00:53:27.949 --> 00:53:31.239 A:middle L:90%
years because of power constraints that so that your run

939
00:53:31.239 --> 00:53:38.269 A:middle L:90%
the course slower or blank, that's it also have

940
00:53:38.269 --> 00:53:45.510 A:middle L:90%
to select operating of course you're That's right. So

941
00:53:45.510 --> 00:53:50.579 A:middle L:90%
the comment question by Professor eric was it was uh

942
00:53:50.590 --> 00:53:52.559 A:middle L:90%
there's there's forecast that the doubling of the number of

943
00:53:52.559 --> 00:53:55.010 A:middle L:90%
courses at some point going to reach an end.

944
00:53:55.019 --> 00:53:58.800 A:middle L:90%
Maybe it's this coming decade, maybe it's the next

945
00:53:58.809 --> 00:54:00.250 A:middle L:90%
. Not sure yet. And then the question ends

946
00:54:00.250 --> 00:54:01.619 A:middle L:90%
up being, well that may end up, you

947
00:54:01.619 --> 00:54:05.349 A:middle L:90%
have to slow down the process of speeds and and

948
00:54:05.360 --> 00:54:07.320 A:middle L:90%
and try and save energy because you're reaching a power

949
00:54:07.320 --> 00:54:08.750 A:middle L:90%
limit. There may be other technologies in the area

950
00:54:08.750 --> 00:54:12.159 A:middle L:90%
of quantum computing that might come to bear. We

951
00:54:12.159 --> 00:54:15.239 A:middle L:90%
don't know. Um but part of part of what

952
00:54:15.239 --> 00:54:16.239 A:middle L:90%
we're trying to do, in fact, with respect

953
00:54:16.239 --> 00:54:21.780 A:middle L:90%
to uh let me show this with respect to this

954
00:54:21.780 --> 00:54:23.079 A:middle L:90%
box here which is in purple, the performance power

955
00:54:23.079 --> 00:54:27.960 A:middle L:90%
models is what we're trying to model. The processors

956
00:54:27.969 --> 00:54:30.170 A:middle L:90%
, the CPU and Gpu in a way that we

957
00:54:30.170 --> 00:54:34.190 A:middle L:90%
can manipulate the level of parallelism. The frequency and

958
00:54:34.190 --> 00:54:36.800 A:middle L:90%
voltage is with which they operate on so that we

959
00:54:36.800 --> 00:54:38.460 A:middle L:90%
can optimize for power as best as possible. That's

960
00:54:38.460 --> 00:54:40.449 A:middle L:90%
one step toward it. I'm not saying that we're

961
00:54:40.460 --> 00:54:43.300 A:middle L:90%
completely answered it but it's a very good point.

962
00:54:43.840 --> 00:54:45.559 A:middle L:90%
Well, it's it's exciting times in another 10 years

963
00:54:45.559 --> 00:54:49.380 A:middle L:90%
. We'll have to see what uh what what the

964
00:54:49.389 --> 00:54:52.780 A:middle L:90%
future may bear. So this is where uh this

965
00:54:52.780 --> 00:54:57.599 A:middle L:90%
particular paper is uh submitted and future work is ideally

966
00:54:57.599 --> 00:54:59.909 A:middle L:90%
as we want to combine these aspects. Um I've

967
00:54:59.909 --> 00:55:05.320 A:middle L:90%
been uh in discussions with a colleague uh compiler colleague

968
00:55:05.329 --> 00:55:07.599 A:middle L:90%
who is really interested in the notion of instead of

969
00:55:07.599 --> 00:55:10.659 A:middle L:90%
just looking at blocks, uh control blocks uh to

970
00:55:10.659 --> 00:55:14.280 A:middle L:90%
optimize between control points. If we can then go

971
00:55:14.280 --> 00:55:16.239 A:middle L:90%
beyond the control blocks and start to look at these

972
00:55:16.239 --> 00:55:19.670 A:middle L:90%
notions of dwarfs and try and get a medal level

973
00:55:19.670 --> 00:55:23.199 A:middle L:90%
understanding of how we can approximate the better performance by

974
00:55:23.199 --> 00:55:28.420 A:middle L:90%
doing certain mapping of data and computation in places.

975
00:55:28.429 --> 00:55:31.130 A:middle L:90%
But that's a really, really difficult problem um that

976
00:55:31.139 --> 00:55:34.400 A:middle L:90%
I am ill equipped to answer but that's why I

977
00:55:34.400 --> 00:55:37.489 A:middle L:90%
have my colleagues and the compiler area to help me

978
00:55:37.489 --> 00:55:43.019 A:middle L:90%
out um performance and power modeling. All right.

979
00:55:43.019 --> 00:55:44.780 A:middle L:90%
I mean there's a breeze through these last couple of

980
00:55:44.780 --> 00:55:46.880 A:middle L:90%
30 time. So this just gives you an idea

981
00:55:46.889 --> 00:55:50.570 A:middle L:90%
of some of the challenges we face on the y

982
00:55:50.570 --> 00:55:55.559 A:middle L:90%
axis you have execution time on and energy consumed uh

983
00:55:55.570 --> 00:55:59.989 A:middle L:90%
and on the X axis the first number is the

984
00:55:59.989 --> 00:56:01.820 A:middle L:90%
number of Cpus and the second number is the number

985
00:56:01.820 --> 00:56:06.099 A:middle L:90%
of threads per CPU So the idea is you think

986
00:56:06.099 --> 00:56:07.739 A:middle L:90%
, okay, we have parallelism. So if you

987
00:56:07.739 --> 00:56:09.980 A:middle L:90%
throw more processors and more threads per processor at the

988
00:56:09.980 --> 00:56:15.940 A:middle L:90%
problem, you should get better performance. That's not

989
00:56:15.940 --> 00:56:19.099 A:middle L:90%
the case in this case is the energy sort when

990
00:56:19.099 --> 00:56:21.909 A:middle L:90%
I used all four cores and two threads per core

991
00:56:21.920 --> 00:56:24.130 A:middle L:90%
, I got the worst performance on top of that

992
00:56:24.260 --> 00:56:29.809 A:middle L:90%
, I consume the most amount of energy. So

993
00:56:29.989 --> 00:56:34.130 A:middle L:90%
there's a lot of space here to try and figure

994
00:56:34.130 --> 00:56:37.159 A:middle L:90%
out from a modeling perspective, how do you automate

995
00:56:37.159 --> 00:56:40.429 A:middle L:90%
this process so that you can optimize either performance Power

996
00:56:40.539 --> 00:56:45.739 A:middle L:90%
or some hybrid of the two co optimization. Right

997
00:56:46.230 --> 00:56:47.550 A:middle L:90%
. Of these two things. All right. So

998
00:56:49.130 --> 00:56:52.699 A:middle L:90%
we're looking at a framework that has higher accuracy prediction

999
00:56:52.710 --> 00:56:55.769 A:middle L:90%
, identify portable predictors that will go across different architectures

1000
00:56:55.780 --> 00:57:00.030 A:middle L:90%
um and then do a multi dimensional characterization of it

1001
00:57:00.039 --> 00:57:02.579 A:middle L:90%
. Uh sequential code, intra node parallel inter node

1002
00:57:02.579 --> 00:57:06.659 A:middle L:90%
parallel as well as looking at different power aspects,

1003
00:57:06.659 --> 00:57:07.309 A:middle L:90%
which is this is the part that google has been

1004
00:57:07.309 --> 00:57:10.300 A:middle L:90%
particularly interested in of late. We're able to through

1005
00:57:10.300 --> 00:57:14.980 A:middle L:90%
software monitor power consumption of subsystems within a computer without

1006
00:57:14.980 --> 00:57:16.860 A:middle L:90%
having to open it up and attach a silla scopes

1007
00:57:16.860 --> 00:57:22.289 A:middle L:90%
and digital digital meters. Um This is some algorithms

1008
00:57:22.300 --> 00:57:24.909 A:middle L:90%
. Uh so we actually cast this as a linear

1009
00:57:24.909 --> 00:57:28.849 A:middle L:90%
programming problem for for the work that we've done um

1010
00:57:29.230 --> 00:57:30.440 A:middle L:90%
eliminating the notion of multi core for the time being

1011
00:57:30.440 --> 00:57:34.489 A:middle L:90%
. We're just looking at multi processor. Um and

1012
00:57:34.500 --> 00:57:37.000 A:middle L:90%
I'm not going to get into the details. I'm

1013
00:57:37.000 --> 00:57:38.090 A:middle L:90%
just gonna say that we applied that and we got

1014
00:57:38.090 --> 00:57:42.400 A:middle L:90%
results such as the following where um we're running on

1015
00:57:42.400 --> 00:57:45.460 A:middle L:90%
a bunch of different parallel codes And we affected,

1016
00:57:45.469 --> 00:57:50.409 A:middle L:90%
we put in a bound of less than 5% performance

1017
00:57:50.420 --> 00:57:53.159 A:middle L:90%
uh impact. We violated it in this one case

1018
00:57:53.159 --> 00:57:55.269 A:middle L:90%
, in the cognitive gradient case it was more like

1019
00:57:55.269 --> 00:57:59.309 A:middle L:90%
an 8% impact on on performance. But what you

1020
00:57:59.309 --> 00:58:04.909 A:middle L:90%
see is on average we saved uh 19% uh of

1021
00:58:05.170 --> 00:58:07.269 A:middle L:90%
of power consumption, I'm sorry, energy consumption.

1022
00:58:07.269 --> 00:58:09.710 A:middle L:90%
In one case we actually got energy improvement, which

1023
00:58:09.710 --> 00:58:13.349 A:middle L:90%
is yet something else that we didn't we haven't looked

1024
00:58:13.349 --> 00:58:16.719 A:middle L:90%
in and in great detail. Um so the last

1025
00:58:16.719 --> 00:58:20.809 A:middle L:90%
thing I'm going to wrap up with is this notion

1026
00:58:20.809 --> 00:58:23.750 A:middle L:90%
of the task scheduler, which is heterogeneous task scheduling

1027
00:58:23.750 --> 00:58:27.280 A:middle L:90%
system. And so what we wanna do is automatically

1028
00:58:27.280 --> 00:58:31.869 A:middle L:90%
spread tasks across different heterogeneous computing units from Cpus gpus

1029
00:58:31.869 --> 00:58:37.510 A:middle L:90%
JPs we want this automatic so that uh application scientists

1030
00:58:37.510 --> 00:58:39.590 A:middle L:90%
and engineers don't have to worry about the mundane details

1031
00:58:39.590 --> 00:58:43.400 A:middle L:90%
of doing this mapping and so you want this runtime

1032
00:58:43.409 --> 00:58:45.929 A:middle L:90%
system to intelligently use what's available resource wise and optimize

1033
00:58:45.929 --> 00:58:51.639 A:middle L:90%
for performance portability. Um we're taking a languages approach

1034
00:58:51.639 --> 00:58:54.000 A:middle L:90%
to this. Um right now, initial point is

1035
00:58:54.000 --> 00:58:58.489 A:middle L:90%
we're focusing on open open mp. Um it's a

1036
00:58:58.489 --> 00:59:00.909 A:middle L:90%
directive based model. I'm not expecting you to take

1037
00:59:00.920 --> 00:59:02.739 A:middle L:90%
understand all of this, but the point is,

1038
00:59:02.739 --> 00:59:07.429 A:middle L:90%
is that um what is being what was done traditionally

1039
00:59:07.429 --> 00:59:10.139 A:middle L:90%
is an opera left. What's been proposed by the

1040
00:59:10.320 --> 00:59:14.190 A:middle L:90%
committees is an upper right. And what we're proposing

1041
00:59:14.190 --> 00:59:16.480 A:middle L:90%
to add is on the bottom and the idea is

1042
00:59:16.480 --> 00:59:21.110 A:middle L:90%
that this just identity helps the compiler identify what regions

1043
00:59:21.110 --> 00:59:23.469 A:middle L:90%
are parallel and this is one step towards doing this

1044
00:59:23.469 --> 00:59:28.019 A:middle L:90%
fully automated instead of having uh user directives or compiler

1045
00:59:28.019 --> 00:59:32.619 A:middle L:90%
directives that are inserted. Um so this is the

1046
00:59:32.619 --> 00:59:36.920 A:middle L:90%
way programs typically run you have an open uh,

1047
00:59:36.929 --> 00:59:38.489 A:middle L:90%
mp parallel region, you run in parallel on the

1048
00:59:38.489 --> 00:59:42.889 A:middle L:90%
CPU, then you join and then you fork again

1049
00:59:42.889 --> 00:59:45.079 A:middle L:90%
you said, oh, here's the accelerator region running

1050
00:59:45.079 --> 00:59:46.170 A:middle L:90%
on the GPU. So you're running either on the

1051
00:59:46.170 --> 00:59:50.989 A:middle L:90%
CPU or the GPU. But the semantics of the

1052
00:59:50.989 --> 00:59:52.070 A:middle L:90%
language are such that you can't do both at the

1053
00:59:52.070 --> 00:59:54.039 A:middle L:90%
same time, which is what we would like to

1054
00:59:54.039 --> 00:59:57.289 A:middle L:90%
do. We want to be able to it will

1055
00:59:57.289 --> 00:59:59.090 A:middle L:90%
run both on the Cpus and the GPS at the

1056
00:59:59.090 --> 01:00:02.170 A:middle L:90%
same time. And so there's there's actually a linear

1057
01:00:02.179 --> 01:00:06.380 A:middle L:90%
linear optimization scheduling model behind this runtime scheduling system as

1058
01:00:06.380 --> 01:00:07.570 A:middle L:90%
well that I'm I'm not going to get into here

1059
01:00:07.579 --> 01:00:10.050 A:middle L:90%
. I'm just gonna show you some results. So

1060
01:00:10.059 --> 01:00:13.690 A:middle L:90%
, these are codes that have already been paralyzed at

1061
01:00:13.690 --> 01:00:15.829 A:middle L:90%
design and compile time. All we're going to do

1062
01:00:15.829 --> 01:00:17.670 A:middle L:90%
is see if we can extract any additional performance out

1063
01:00:17.670 --> 01:00:22.500 A:middle L:90%
of it at runtime automatically. And so it's all

1064
01:00:22.510 --> 01:00:25.210 A:middle L:90%
, it's all um normalized to running on the CPU

1065
01:00:25.219 --> 01:00:29.480 A:middle L:90%
. So Cpus speed one and what you see is

1066
01:00:29.480 --> 01:00:31.960 A:middle L:90%
that on the Gpu alone, uh, we do

1067
01:00:31.960 --> 01:00:35.440 A:middle L:90%
worse than most of the case in two of the

1068
01:00:35.440 --> 01:00:37.739 A:middle L:90%
cases and we do better in two of the cases

1069
01:00:37.750 --> 01:00:38.409 A:middle L:90%
, Gem and K means. And that's because these

1070
01:00:38.409 --> 01:00:45.949 A:middle L:90%
are more amenable to parallelization on the GPU, the

1071
01:00:45.960 --> 01:00:47.429 A:middle L:90%
four other ones, which I won't get into the

1072
01:00:47.429 --> 01:00:51.920 A:middle L:90%
details of their all four of these are dynamically trying

1073
01:00:51.920 --> 01:00:55.119 A:middle L:90%
to decide through either a static decision making process or

1074
01:00:55.119 --> 01:01:00.679 A:middle L:90%
dynamic one and then some some uh variations of it

1075
01:01:00.800 --> 01:01:01.690 A:middle L:90%
. They're trying to decide how much work to put

1076
01:01:01.690 --> 01:01:07.800 A:middle L:90%
on the Cpus and Gpus concurrently, simultaneously running at

1077
01:01:07.800 --> 01:01:09.159 A:middle L:90%
the same time. So what you see here is

1078
01:01:09.159 --> 01:01:15.219 A:middle L:90%
that That parallel code that we got 371-fold speed up

1079
01:01:15.219 --> 01:01:17.900 A:middle L:90%
on now at runtime we get an additional eight fold

1080
01:01:17.900 --> 01:01:22.590 A:middle L:90%
speed up automatically through our runtime system. On the

1081
01:01:22.590 --> 01:01:25.349 A:middle L:90%
other hand, you see here that the CPU was

1082
01:01:25.349 --> 01:01:29.820 A:middle L:90%
best for the code. All the other ones did

1083
01:01:29.820 --> 01:01:32.730 A:middle L:90%
worse and that's it. But our auto adapting one

1084
01:01:32.730 --> 01:01:35.559 A:middle L:90%
did pretty well. The reason why I didn't do

1085
01:01:35.559 --> 01:01:37.639 A:middle L:90%
quite as well is because there's time involved in solving

1086
01:01:37.639 --> 01:01:42.559 A:middle L:90%
the linear uh optimization problem to decide to move all

1087
01:01:42.559 --> 01:01:46.440 A:middle L:90%
of the work onto the CPU. All right.

1088
01:01:46.440 --> 01:01:50.170 A:middle L:90%
So that that's a recent publication, uh you can

1089
01:01:50.179 --> 01:01:52.329 A:middle L:90%
look at a heterogeneous task scheduling for accelerated open mp

1090
01:01:52.340 --> 01:01:55.840 A:middle L:90%
. Um We're actually generalizing it to open cl and

1091
01:01:55.849 --> 01:02:00.289 A:middle L:90%
in general parallel computing environments. So there's a ton

1092
01:02:00.289 --> 01:02:02.300 A:middle L:90%
of work that's still to be done. Um And

1093
01:02:02.300 --> 01:02:05.090 A:middle L:90%
this is just the part that we're doing with respect

1094
01:02:05.090 --> 01:02:07.280 A:middle L:90%
to heterogeneous computing. Uh Like I said, we

1095
01:02:07.280 --> 01:02:09.849 A:middle L:90%
have we have two projects and cloud computing um we

1096
01:02:09.849 --> 01:02:12.440 A:middle L:90%
have a lot of things going on in the green

1097
01:02:12.440 --> 01:02:15.170 A:middle L:90%
computing space as well. But this is where we

1098
01:02:15.170 --> 01:02:19.190 A:middle L:90%
are at. And uh we're looking to go next

1099
01:02:19.199 --> 01:02:23.369 A:middle L:90%
. And uh that's a sampling of publications, which

1100
01:02:23.369 --> 01:02:25.789 A:middle L:90%
I'm going to. You're not gonna be able to

1101
01:02:25.800 --> 01:02:29.429 A:middle L:90%
grok all this, but you just go to synergy

1102
01:02:29.429 --> 01:02:30.309 A:middle L:90%
dot CS dot VT dot e d u. And

1103
01:02:30.320 --> 01:02:34.510 A:middle L:90%
you'll see a number of these publications. Um be

1104
01:02:34.510 --> 01:02:37.289 A:middle L:90%
remiss I would be remiss if I didn't acknowledge a

1105
01:02:37.300 --> 01:02:42.019 A:middle L:90%
number of the funding agencies that have supported this generously

1106
01:02:42.019 --> 01:02:45.090 A:middle L:90%
supported this work. Um Especially some of my really

1107
01:02:45.099 --> 01:02:51.409 A:middle L:90%
aggressive and maybe uh idealistic views of where you'd like

1108
01:02:51.409 --> 01:02:54.460 A:middle L:90%
to be. But this this these funds have helped

1109
01:02:54.469 --> 01:02:58.769 A:middle L:90%
support us trying some things that we might not otherwise

1110
01:02:58.769 --> 01:03:01.960 A:middle L:90%
would have tried. And if you want more information

1111
01:03:01.960 --> 01:03:06.250 A:middle L:90%
, you can go to any of those websites and

1112
01:03:06.250 --> 01:03:09.510 A:middle L:90%
my contact information is above. Okay. Mhm.

1113
01:03:15.199 --> 01:03:16.340 A:middle L:90%
We have time for one more one question or should

1114
01:03:16.340 --> 01:03:55.230 A:middle L:90%
we let him go? Yeah, go ahead my

1115
01:03:58.800 --> 01:04:03.289 A:middle L:90%
Yes, yes, yes, that's an excellent question

1116
01:04:03.289 --> 01:04:05.730 A:middle L:90%
. So I'll just summarize it. The question was

1117
01:04:05.730 --> 01:04:10.159 A:middle L:90%
really about uh these uh these optimizations are well and

1118
01:04:10.159 --> 01:04:13.400 A:middle L:90%
good uh but their system level more or less than

1119
01:04:13.409 --> 01:04:16.360 A:middle L:90%
and there are things like data dependencies that occur that

1120
01:04:16.369 --> 01:04:20.030 A:middle L:90%
will uh not necessarily allow those optimizations to apply.

1121
01:04:20.039 --> 01:04:24.320 A:middle L:90%
And so that is definitely the case. But what

1122
01:04:24.320 --> 01:04:28.530 A:middle L:90%
I would say is that um the data dependencies,

1123
01:04:28.539 --> 01:04:30.380 A:middle L:90%
we're trying to address it at two levels. One

1124
01:04:30.380 --> 01:04:32.840 A:middle L:90%
is back here, the architecture where optimizations, those

1125
01:04:32.840 --> 01:04:36.329 A:middle L:90%
are system levels. It's not really paying attention to

1126
01:04:36.340 --> 01:04:41.369 A:middle L:90%
uh run time data dependencies because it can't, it's

1127
01:04:41.369 --> 01:04:43.820 A:middle L:90%
doing it at compile time but we're hoping to do

1128
01:04:43.829 --> 01:04:45.769 A:middle L:90%
is try to get some notion of what the computational

1129
01:04:45.769 --> 01:04:48.500 A:middle L:90%
and communication pattern is. So at a higher level

1130
01:04:48.500 --> 01:04:51.630 A:middle L:90%
we may be able to apply some crude optimizations that

1131
01:04:51.630 --> 01:04:54.360 A:middle L:90%
would say, well, you know what based on

1132
01:04:54.360 --> 01:04:58.469 A:middle L:90%
this type of application signature, it doesn't pay for

1133
01:04:58.469 --> 01:05:02.070 A:middle L:90%
you to move the computation to the discrete Gpu use

1134
01:05:02.090 --> 01:05:08.000 A:middle L:90%
the undie Gpu because the cost of moving that data

1135
01:05:08.090 --> 01:05:11.710 A:middle L:90%
for this particular application signature is too high. So

1136
01:05:11.710 --> 01:05:13.739 A:middle L:90%
that kind of optimization, we would hope that it

1137
01:05:13.739 --> 01:05:15.489 A:middle L:90%
could be done at compile time. I think that

1138
01:05:15.489 --> 01:05:16.070 A:middle L:90%
one is going to be the harder, much harder

1139
01:05:16.070 --> 01:05:18.159 A:middle L:90%
one to do. Definitely harder for me because I'm

1140
01:05:18.159 --> 01:05:21.090 A:middle L:90%
not a compiler person. Um what we're doing here

1141
01:05:21.090 --> 01:05:25.849 A:middle L:90%
though is in the runtime system based on those data

1142
01:05:25.849 --> 01:05:29.039 A:middle L:90%
dependencies, we're figuring out where the time is being

1143
01:05:29.039 --> 01:05:32.289 A:middle L:90%
spent and we're automatically reallocating where the tasks are on

1144
01:05:32.289 --> 01:05:36.780 A:middle L:90%
Cpus and Gpus and Cpus at runtime based on the

1145
01:05:36.780 --> 01:05:42.059 A:middle L:90%
execution control flow through live, live execution of a

1146
01:05:42.070 --> 01:05:53.090 A:middle L:90%
program. Yeah, we're doing Yeah, yeah.

1147
01:05:53.590 --> 01:05:55.719 A:middle L:90%
So we could do it. These are excellent questions

1148
01:05:55.730 --> 01:05:58.489 A:middle L:90%
. Those questions like we're trying to manually do this

1149
01:05:58.489 --> 01:06:00.269 A:middle L:90%
and in some sense, initially we did this manually

1150
01:06:00.269 --> 01:06:03.369 A:middle L:90%
because we wanted to understand the effectiveness or the efficacy

1151
01:06:03.369 --> 01:06:06.269 A:middle L:90%
of potentially making use of all these devices and if

1152
01:06:06.269 --> 01:06:10.510 A:middle L:90%
we were omnipotent, how we would go about orchestrating

1153
01:06:10.519 --> 01:06:14.989 A:middle L:90%
who gets what now we've gotten to the point that

1154
01:06:15.000 --> 01:06:18.059 A:middle L:90%
we learn this through the runtime system. We see

1155
01:06:18.059 --> 01:06:23.239 A:middle L:90%
how it's executed the past and we learn and try

1156
01:06:23.239 --> 01:06:27.380 A:middle L:90%
to predict how to divvy up the work on the

1157
01:06:27.380 --> 01:06:29.590 A:middle L:90%
different Cpus and Gpus and how much of it to

1158
01:06:29.590 --> 01:06:31.230 A:middle L:90%
divvy up on the Cpus and Gpus. Okay,

1159
01:06:31.230 --> 01:06:33.059 A:middle L:90%
so that's all, that's that. That part is

1160
01:06:33.059 --> 01:06:40.579 A:middle L:90%
automatic right now. Okay. Yeah. Yeah.

1161
01:06:41.079 --> 01:06:54.909 A:middle L:90%
You want right. Oh I right. Uh huh

1162
01:06:55.380 --> 01:07:02.900 A:middle L:90%
. Mhm um, so, so the question is

1163
01:07:02.900 --> 01:07:06.429 A:middle L:90%
, how would companies be leveraging cell phones and mobile

1164
01:07:06.429 --> 01:07:10.219 A:middle L:90%
platforms and what have you? I mean It's not

1165
01:07:10.219 --> 01:07:12.750 A:middle L:90%
necessarily the case that those 97% of the companies are

1166
01:07:12.760 --> 01:07:15.989 A:middle L:90%
making you specifically of mobile phones and and desktop systems

1167
01:07:15.000 --> 01:07:17.380 A:middle L:90%
. Many of them are just making use of servers

1168
01:07:17.389 --> 01:07:19.619 A:middle L:90%
or the cloud or what have you. Um but

1169
01:07:19.619 --> 01:07:21.360 A:middle L:90%
in the case for if you're trying to think about

1170
01:07:21.360 --> 01:07:24.030 A:middle L:90%
ways that companies could make use of it. I

1171
01:07:24.030 --> 01:07:27.800 A:middle L:90%
mean you could think of uh, if you had

1172
01:07:27.800 --> 01:07:30.960 A:middle L:90%
some special iphone app or android app that does real

1173
01:07:30.960 --> 01:07:34.019 A:middle L:90%
time financial data analytics that allow you to figure out

1174
01:07:34.019 --> 01:07:35.469 A:middle L:90%
whether or not you want to trade a stock or

1175
01:07:35.469 --> 01:07:41.409 A:middle L:90%
not. That's one example or real. I'm uh

1176
01:07:41.420 --> 01:07:43.710 A:middle L:90%
weather forecasting and we've got to go to the weather

1177
01:07:43.849 --> 01:07:47.269 A:middle L:90%
, I'm really coarse grains slow being forecast but maybe

1178
01:07:47.280 --> 01:07:49.710 A:middle L:90%
sometime in the future you'll be able to use the

1179
01:07:49.710 --> 01:07:53.389 A:middle L:90%
capability that you have on your phone link up into

1180
01:07:53.389 --> 01:07:57.309 A:middle L:90%
a also as these heterogeneous resources to be able to

1181
01:07:57.320 --> 01:08:00.730 A:middle L:90%
do your own localized weather forecasting within the new River

1182
01:08:00.730 --> 01:08:02.449 A:middle L:90%
Valley area, for example, a little bit pie

1183
01:08:02.449 --> 01:08:03.980 A:middle L:90%
in the sky. But let me just give you

1184
01:08:03.980 --> 01:08:08.710 A:middle L:90%
an idea. All right, well, uh,

1185
01:08:08.719 --> 01:08:10.650 A:middle L:90%
thanks. I'll stick around for a little bit.

1186
01:08:10.650 --> 01:08:12.849 A:middle L:90%
I, uh, I've got to run to actually

1187
01:08:12.860 --> 01:08:14.650 A:middle L:90%
, I'm going to go to another meeting in two

1188
01:08:14.650 --> 01:08:16.270 A:middle L:90%
minutes, technically, but I'll stick around for five

1189
01:08:16.270 --> 01:08:18.350 A:middle L:90%
minutes or so if you have any questions. All

1190
01:08:18.350 -->  A:middle L:90%
right.

