Ex Parte Munshi

Patent Trials and Appeals BoardApr 30, 2019

14138967 - (D) (P.T.A.B. Apr. 30, 2019)

UNITED STA TES p A TENT AND TRADEMARK OFFICE APPLICATION NO. FILING DATE 14/138,967 12/23/2013 61947 7590 05/02/2019 Apple - Blank Rome c/o Blank Rome LLP 717 Texas Avenue, Suite 1400 HOUSTON, TX 77002 FIRST NAMED INVENTOR Aaftab A. Munshi UNITED STATES DEPARTMENT OF COMMERCE United States Patent and Trademark Office Address: COMMISSIONER FOR PATENTS P.O. Box 1450 Alexandria, Virginia 22313-1450 www .uspto.gov ATTORNEY DOCKET NO. CONFIRMATION NO. Pl 7235US1 (l 19-0498US1) 4074 EXAMINER LIU, GORDON G ART UNIT PAPER NUMBER 2612 NOTIFICATION DATE DELIVERY MODE 05/02/2019 ELECTRONIC Please find below and/or attached an Office communication concerning this application or proceeding. The time period for reply, if any, is set in the attached communication. Notice of the Office communication was sent electronically on above-indicated "Notification Date" to the following e-mail address(es): mbrininger@blankrome.com houstonpatents@blankrome.com PTOL-90A (Rev. 04/07) UNITED STATES PATENT AND TRADEMARK OFFICE BEFORE THE PATENT TRIAL AND APPEAL BOARD Ex parte AAFT AB A. MUNSHI Appeal2018-002695 Application 14/138,967 Technology Center 2600 Before JOHN A. JEFFERY, DENISE M. POTHIER, and MATTHEW J. McNEILL, Administrative Patent Judges. POTHIER, Administrative Patent Judge. DECISION ON APPEAL STATEMENT OF THE CASE Appellant 1,2 appeals under 35 U.S.C. Â§ 134(a) from the Examiner's final rejection of claims 1, 3-10, 12-14, and 17-24 under 35 U.S.C. Â§ 103 as unpatentable over Woolley (Cliff Woolley, INTRODUCTION TO OPENCL 1-57 (2010)), Rogers (U.S. Publication No. 2014/0149710 Al, published May 29, 2014), CUDA (NVIDIA, DYNAMIC PARALLELISM IN CUDA 1-2 (2012)), and Claessen (Koen Claessen et al., Expressive Array Constructs in an 1 Throughout this Opinion, we refer to the Final Office Action (Final Act.) mailed March 3, 2017, the Appeal Brief (Appeal Br.) filed August 15, 2017, the Examiner's Answer (Ans.) mailed November 17, 2017, and the Reply Brief (Reply Br.) filed January 17, 2018. 2 Appellant identifies the real party in interest as Apple, Inc. Appeal Br. 3. Appeal2018-002695 Application 14/138,967 Embedded GPU Kernel Programming Language, Proc. of 7th Workshop Declarative Aspects & Apps. Multicore Prog. DAMP 21-30 (2012)). Final Act. 4--17. Claims 2, 11, 15, and 16 have been canceled. Appeal Br. 23-25 (Claims App'x). We reverse. Invention Graphic processor units or "GPUs continue to evolve into high performance parallel compute devices" (Spec. ,r 2). But, a "GPU cannot create work for itself." Id. Standards have evolved for programming high-performance compute systems incorporating CPUs and GPUs. One such standard is the OpenCL (Open Computing Language) industry standard ... for general-purpose parallel programming of heterogeneous systems, which ... provide[ s] an execution model ... to write code with enhanced performance and functionality. Id. ,r 3. Appellant's invention "relates to an execution model for allowing a graphics processor unit ... to enqueue kernels for execution on the GPU" (Spec. ,r 1) and "without needing a host processor to queue the work for the GPU" (Spec. ,r 4). Independent claim 1 exemplifies the claims at issue and reads as follows: 1. A program storage device, on which are stored instructions, comprising instructions that when executed cause one or more compute units to: enqueue a first kernel by a first compute unit for execution on a second compute unit, wherein the second compute unit is a graphical processing unit, and the first kernel operates over a first range; determine, based on the execution of the first kernel, that a new range is required; and 2 Appeal2018-002695 Application 14/138,967 enqueue a second kernel by execution of the first kernel on the second compute unit, the second kernel for execution on the second compute unit, wherein the second kernel operates over a second range. Appeal Br. 23 (Claims App'x). THE CONTENTIONS Although the Examiner relied on Rogers to teach the second compute unit is a GPU as claim 1 recites, the Examiner later indicated that Rogers is duplicative. Compare Final Act. 6 ( citing Rogers ,r 18, Fig. 1) with Ans. 5 (stating Woolley teaches "the element of [a] GPU."). We therefore confine our discussion to Woolley, CUDA, and Claessen. The Examiner found Woolley does not teach some of claim 1 's limitations, including among others, "determin[ing], based on the execution of the first kernel, that a new range is required" and "enqueu[ing] a second kernel by execution of the first kernel on the second compute unit, ... wherein the second kernel operates over a second range." Final Act. 4--5 (citing Woolley 5, 8-10, 15). The Examiner relied on (a) CUDA to teach "enqueu[ing] a second kernel by execution of the first kernel on the second compute unit, the second kernel for execution on the second compute unit" and "determin[ing], based on the execution of the first kernel" (id. at 5 ( citing CUDA 1 )), and (b) Claessen to teach the first kernel operates over a first range, a new range is required, and the second kernel operates over a second range (id. at 6 ( citing Claessen 22)). In the Answer, the Examiner further explained his position. Ans. 3-5 (discussing Woolley, CUDA, and Claessen). 3 Appeal2018-002695 Application 14/138,967 Appellant contends Woolley and Claessen do not teach the "range" features in claim 1, including: "enqueu[ing] a second kernel by execution of the first kernel on the second compute unit, ... the second kernel operates over a second range" and "determin[ing], based on the execution of the first kernel, that a new range is required." Appeal Br. 9, 11-12. As explained below, we agree. ANALYSIS In the rejection, the Examiner mapped the recited "second kernel" to Woolley's parallel code. Final Act. 5 (citing Woolley 10). In the Answer, the Examiner further "map[ped] the serial codes including partial parallel codes into the first kernel." Ans. 4 ( emphasis added). Given the Examiner took the position that partial parallel codes are part of the "first kernel," what in Woolley is mapped to the recited "second kernel" in claim 1 is not immediately apparent. Based on the qualifier "partial" in front of parallel codes, we presume that the Examiner continued to map other partial parallel codes or some part of the parallel codes in Woolley to the recited "second kernel." We find this mapping problematic. First, Woolley describes that serial code executes in a Host thread and parallel code executes in many Device threads. Woolley 10. Woolley does not discuss the Host executing serial code and parallel codes, even in part. See id.; see Reply Br. 5. Second, Woolley does not discuss a second kernel (e.g., part of parallel codes) is enqueued by execution of the first kernel (e.g., serial codes and 4 Appeal2018-002695 Application 14/138,967 partial parallel codes) as claim 1 recites. 3 Third, Woolley may suggest the OpenCL Application uses the Host to enqueue kernel execution instances to the Devices. See Woolley 10, 15. But, Woolley does not discuss part of the parallel code threads (the mapped "second kernel") are the same as the "kernel execution instances" enqueued to the Devices. Id. at 15. Despite the Examiner's mapping (Ans. 4), Woolley teaches enqueuing a first ( or second) kernel ( e.g., kernel execution instances) by a first compute unit ( e.g., a CPU) for execution on a second compute unit ( e.g., a GPU) as claim 1 requires. See Woolley 8, 15, cited in Final Act. 4--5. Specifically, Woolley teaches an application runs on a Host (e.g., a CPU), which submits work to a Device ( e.g., GPU) by queuing kernel execution instances ( e.g., a kernel) to the Device ( e.g., enqueuing a kernel for execution on a second compute unit, such as a CUDA-Enabled GPU). Woolley 8, 15. But, Woolley does not teach explicitly a second kernel is enqueued on the second compute unit by executing a first kernel as claim 1 requires. Id. As such, Woolley does not teach the disputed "enqueu[ing] a second kernel by execution of the first kernel on the second compute unit" in claim 1. Appeal Br. 23 (Claims App'x) As to this latter point, the rejection additionally relies on CUDA to teach enqueu[ing]" one kernel by "execution of' another kernel. See Final Act. 5 (citing CUDA 1). CUDA discusses how a CUDA kernel creates nested work by using a CUDA runtime API (application programming 3 As noted, Woolley teaches the Host-not the Devices----executes serial code. Woolley 10. This passage does not discuss enqueuing a first kernel, which includes the serial code, "for execution on a second compute unit" ( e.g., one of the Devices) as claim 1 further recites. 5 Appeal2018-002695 Application 14/138,967 interface) to launch other kernels. CUDA 1. Because Woolley teaches a Device is a "CUDA-Enabled GPU" (Woolley 8), combining CUDA with Woolley suggests work or a kernel execution instance, submitted to the Devices in Woolley (Woolley 15), can also launch a second kernel as taught by CUDA (CUDA 1). CUDA also discusses the runtime API runs on a host or device and a kernel can call GPU libraries without needing the CPU. CUDA 2. But, CUDA does not discuss explicitly queuing the kernel to be launched. See id. Nonetheless, presuming the Woolley and CUDA combination suggests the nested, second kernel is enqueued by execution of the first kernel on the same Device ( e.g., the second compute unit as recited), these teachings do not further teach or suggest "the second kernel operates over a second range" and "determin[ing], based on the execution of the first kernel, that a new range is required"4 as claim 1 recites. See Appeal Br. 5, 12; see Reply Br. 5---6. The Examiner further relied on Claessen to teach the above disputed limitations. Final Act. 6-7 ( citing Claessen 21-22). Claessen discusses a GPU programming language involving a sync operation that enables the writing or generating of parallel reduction kernels. Claessen 21-22. Claessen explains the reduction kernel inputs an array having a length that is a power of two and outputs an array length of one by splitting the input array into two and repeating this process recursively until the array's length is one. Id. at 22. The reduction kernel results in unrolled kernels. Id. Yet, this discussion in Claessen relates to generating the reduction kernel (id.; see id. 4 Notably, the recited "second range" is not required to be the determined "new range" in claim 1. Appeal Br. 23 (Claims App'x). 6 Appeal2018-002695 Application 14/138,967 at 21 ), not executing or operating the kernel, as claim 1 recites and Appellant indicates. Appeal Br. 11 ( stating "Claessen regards making kernels, not executing them."). The Examiner further states Claessen knows its array size at runtime, has subsequent kernel calls, and the range for the subsequent kernel calls is determined by the previous recursive function call. Ans. 5. Other than stating the "kernel input size must be known at compile time" (Claessen 22), Claessen does not address runtime or function calls (see id. at 21-22). Thus, even presuming Claessen teaches an array's length equates to a "range" in light of the disclosure (see Spec. ,r 49) and determines a new range by splitting the array's length in two, the Examiner does not explain sufficiently how Claessen suggests operating a kernel over a certain range during execution or determining a new range is required based on execution of a kernel as claim 1 recites. As such, when combining Claessen with Woolley and CUDA, the combination does not suggest both "determin[ing], based on the execution of the first kernel, that a new range is required" and "the second kernel operat[ing] over a second range" as claim 1 requires. Appeal Br. 23 (Claims App'x) (emphasis added). Lastly, to the extent the rejection relies on CUDA to teach the "determin[ing], based on the execution of the first kernel, that a new range is required" step (Final Act. 5), we agree with Appellant that the Examiner has not considered this limitation in its entirety, but rather parsed a portion of the claim's limitations. Appeal Br. 10. Indeed, the claim's context as a whole provides the proper framework for claim construction. See Abbott Labs. v. Syntron Bioresearch, Inc., 334 F.3d 1343, 1351 (Fed. Cir. 2003) (stating 7 Appeal2018-002695 Application 14/138,967 "[t]he usage of the disputed claim terms in the context of the claims as a whole also informs the proper construction of the terms."). For the foregoing reasons, Appellant has persuaded us of error in the rejection of (1) independent claim 1, (2) independent claims 12 and 18, which recite commensurate limitations, and (3) dependent claims 3-10, 13, 14, 1 7, and 19-24 for similar reasons. DECISION We reverse the Examiner's rejection of claims 1, 3-10, 12-14, and 17-24 under 35 U.S.C. Â§ 103. REVERSED 8