Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020| Boston University 2020 Counting distinct elements 2 ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: webpages ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages hash table ??? Vasiliki Kalavri | Boston University 2020 How can we count the number of distinct elements seen so far in a stream? 3 Example use-case: Distinct users visiting one or multiple webpages0 码力 | 69 页 | 630.01 KB | 1 年前3
Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020Sampling streams 5 ??? Vasiliki Kalavri | Boston University 2020 6 A sample is a set of data elements selected via some random process Samples: the most fundamental synopses input stream add to sample of fixed size, e.g. s elements. 14 ??? Vasiliki Kalavri | Boston University 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously 2020 Instead of a fixed proportion, assume we can only store a sample S of fixed size, e.g. s elements. 14 How can we continuously maintain a representative fixed-size sample of the stream so far0 码力 | 74 页 | 1.06 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 0.24.0implementation from pandas that depends on operators that are already defined on the underly- ing elements (scalars) of the ExtensionArray. See the ExtensionArray Operator Support documentation section Using DataFrame.itertuples() now creates itera- tors without internally allocating lists of all elements (GH20783) • Improved performance of Period constructor, additionally benefitting PeriodArray and • Bug in Series.hasnans() that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (GH19700) • Series.isin() now treats all NaN-floats as equal0 码力 | 2973 页 | 9.90 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 0.14.0cases better (GH6531): – df.iloc[:-len(df)] is now empty – df.iloc[len(df)::-1] now enumerates all elements in reverse 4 Chapter 1. What’s New pandas: powerful Python data analysis toolkit, Release 0.14 prior to 0.14. (GH6760) • Added nunique and value_counts functions to Index for counting unique elements. (GH6734) • stack and unstack now raise a ValueError when the level keyword refers to a non-unique It contains both list and itera- tor versions of range, filter, map and zip, plus other necessary elements for Python 3 compatibility. lmap, lzip, lrange and lfilter all produce lists instead of iterators0 码力 | 1349 页 | 7.67 MB | 1 年前3
Scalable Stream Processing - Spark Streaming and Flinksingle-element RDDs by counting the number of elements in each RDD of the source DStream. ▶ union • Returns a new DStream that contains the union of the elements in two DStreams. 22 / 79 Transformations single-element RDDs by counting the number of elements in each RDD of the source DStream. ▶ union • Returns a new DStream that contains the union of the elements in two DStreams. 22 / 79 Transformations Transformations (4/4) ▶ reduce • Returns a new DStream of single-element RDDs by aggregating the elements in each RDD using a given function. ▶ reduceByKey • Returns a new DStream of (K, V) pairs where the values0 码力 | 113 页 | 1.22 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 0.25.0uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305) • Bug in DataFrame and Series where timezone in Project Governance documents. The documents clarify how decisions are made and how the various elements of our commu- nity interact, including the relationship between open source collaborative development sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number. In [4]: long_series = pd.Series(np.random0 码力 | 2827 页 | 9.62 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 0.25.1uniquely valued Index objects when called with cache=True, with arg including at least two different elements from the set {None, numpy.nan, pandas.NaT} (GH22305) • Bug in DataFrame and Series where timezone in Project Governance documents. The documents clarify how decisions are made and how the various elements of our commu- nity interact, including the relationship between open source collaborative development sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number. In [4]: long_series = pd.Series(np.random0 码力 | 2833 页 | 9.65 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 1.3.2in Project Governance documents. The documents clarify how decisions are made and how the various elements of our commu- nity interact, including the relationship between open source collaborative development values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the third column: In [26]: titanic.iloc[0:3, 3] = "anonymous" In [27]: titanic.head() Out[27]: These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of the values of the0 码力 | 3509 页 | 14.01 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 1.3.3in Project Governance documents. The documents clarify how decisions are made and how the various elements of our commu- nity interact, including the relationship between open source collaborative development values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the third column: In [26]: titanic.iloc[0:3, 3] = "anonymous" In [27]: titanic.head() Out[27]: These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of the values of the0 码力 | 3603 页 | 14.65 MB | 1 年前3
pandas: powerful Python data analysis toolkit - 1.3.4in Project Governance documents. The documents clarify how decisions are made and how the various elements of our commu- nity interact, including the relationship between open source collaborative development values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the third column: In [26]: titanic.iloc[0:3, 3] = "anonymous" In [27]: titanic.head() Out[27]: These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of the values of the0 码力 | 3605 页 | 14.68 MB | 1 年前3
共 195 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20













