Galvanize recently attended the Dato Data Science Summit in San Franci的中文翻譯

Galvanize recently attended the Dat

Galvanize recently attended the Dato Data Science Summit in San Francisco, a gathering of more than 1,000 data scientists and researchers from industry and academia to discuss and learn about the most recent advances in data science, applied machine learning, and predictive applications.

Here are eight Python tools that our instructors think data scientists will be using in the coming months and years:

SFrame and SGraph
One of the biggest announcements out of the Dato Data Science Summit was that SFrame and SGraph will be going open source, available for anyone with a BSD license. SFrame (short for Scaleable Data Frame) is a disk-backed columnar data structure optimized for memory efficiency and performance with a DataFrame like interface. SGraph has a similar ethos but for representing Graphs efficiently. One of the biggest advantage of these two data structures is that they enable a data scientist to do “out of core” analytics with data on datasets that do not fit in memory.

This is a watershed moment for Dato and the Python data community, as the open sourcing of these two libraries signals Dato’s commitment to supporting an open source Python ecosystem around data. There has been a common misconception from the community, since Dato has an enterprise version, that by using the free version they’ll get tied in and end up having to pay. By moving to open source, it’s clear that this sort of bait-and-switch is definitely not Dato’s goal, and now that these two libraries have moved to open source, we’ll hopefully see other developers start adopting their use in their own libraries (I’m looking at you Pandas) to break away from memory limitations.

Bokeh
Bokeh is a Python interactive visualization library that lets you display elaborate, interactive graphics in your web browser, with or without a server. It’s capable of handling very large or even streaming datasets (such as a live spectrogram feed), and is fast, embeddable, and can display novel visualizations such as hover callbacks. It’s useful for anyone who wants to quickly and easily create interactive plots, dashboards, and data applications.

The places that Bokeh really shines is in visualizing large datasets with many points. It’s in working with these datasets that you appreciate Bokeh’s focus on performance. It also enables interactive plots and graphics purely with Python. Currently, for most interactive things you have to use Javascript—Bokeh is a way to do it all in Python.

Dask
Dask is an out-of-core scheduler for Python. It helps you do block-based parallelism on large computations by dividing your data up into chunks and scheduling the computation over however many cores you have. Dask is written in pure Python and leverages the Python ecosystem, primarily targeting parallel computations that run on a single machine.

There are two main ways to interact with dask. Dask users will primarily use dask collections, which are similar to popular libraries such as NumPy and Pandas, but generate graphs internally. Dask developers, on the other hand, will primarily be making graphs directly. Dask graphs encode algorithms using Python dicts, tuples, and functions, and can be used in isolation from the Dask collections.

There are currently a lot of libraries in the Python ecosystem—many of which are coming out of Continuum—that may seem to do the same thing. But these libraries—Blaze, Dask, and Numba—rather than being conflicting libraries, they’re meant to work together at different levels of data processing. By analogy, you can think of Blaze as being similar to a query optimizer in a relational database management system (RDBMS), whereas Dask can be thought of as the execution engine. In this context, Blaze optimizes the symbolic expressions of a query or command, whereas Dask can be used to optimize the execution of it on your hardware.

Ibis
If you’re a data scientist, chances are you use Python on a daily basis. But for everything it’s great at, Python does have its limitations. One of its biggest problems is that Python doesn’t scale very well. It’s great for small data sets, but requires sampling or aggregations for larger data, and using distributed tools can compromise your outcome in various ways.

A new project from Cloudera Labs, Ibis is a data analysis framework that aims to provide the same Python experience data scientists and engineers are used to on any node and data size. It mirrors the single-node Python experience without a compromise in functionality or usability, delivering the same interactive experience and full-fidelity analysis while dealing at the big data scale.

Ibis allows for a 100% Python end-to-end user workflow, allowing for integration with the existing Python data ecosystem (Pandas, Scikit-learn, NumPy, etc). A preview of Ibis is available for installation now, and will be expanding to include more features—such as integration with advanced analytics, machine learning, and other performance computing tools—in the future.

Splash
A common problem when developing web-scraping bots is that many sites use a heavy amount of JavaScript. Webscraping tools have difficulty executing JavaScript, so you often end up with only the raw HTML and not the executed code. Splash, built by Scrapy creator ScrapingHub, is a javascript rendering service, implemented in Python using Twisted and QT. It’s a lightweight web browser with an HTTP API that is capable of processing multiple pages in parallel, executing custom JavaScript, and turning off images or using Adblock to render faster.

Petuum
Petuum is a distributed machine learning framework that aims to provide a generic algorithmic and systems interface to large-scale machine learning. It provides distributed programming tools that can assist with the challenges of running machine learning at scale. Petuum is designed specifically for machine learning, which means that it takes advantage of data correlation, staleness, and other statistical properties to maximize performance.

Petuum has a number of core features: Bösen is a bounded-asynchronous distributed key-value store for data-parallel machine learning programming. It uses the Stale Synchronous Parallel consistency model, which allows asynchronous-like performance without sacrificing algorithm correctness. Another feature is Strads, a dynamic scheduler for model-parallel machine learning programming. It performs fine-grained scheduling of machine learning update operations, prioritizing computation on parts of the program that need it most while avoiding unsafe parallel operations that could hurt performance.

Flink
Apache Flink is an open source platform for scalable batch and stream data processing. The core of Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It’s very similar to Apache Spark, given that one of its primary goals is to serve as a replacement for MapReduce, the aging heart of Hadoop.

The APIs of Spark and Flink are rather similar, but they have a few major differences in how they process data. When Spark processes a stream, it actually uses micro-batching, a fast-batch operation that works on a small part of incoming data during a unit of time. This is an approximation of stream-processing, and normally it’s fine, but it can cause problems and slowdowns in low-latency situations. Flink, on the other hand, is primarily a stream processing framework that can also do batch processing. In other words, instead of being able to do the easy job (batch processing) and an approximation of the hard one (stream processing), Flink was made to do the more difficult job, and can also handle the easier task.

Pyxley
Web-based dashboards are one of the best and straightforward ways to share data science insights. But while Shiny provides a framework for data scientists working in R to build interactive web applications without having to write Javascript, HTML, or CSS, there hasn’t been an equivalent for Python. Pyxley fills that gap—it’s a Python package that simplifies the development of web applications and provides an easy way to incorporate custom Javascript, enabled through Flask, PyReact, and Pandas.

Can’t get enough data science? Register for our upcoming workshop “Spark After Dark: Analytics and Machine Learning” at Galvanize’s San Francisco campus. Use promo code “MLforever” to get 25% off your ticket.


来源:
0/5000
原始語言: -
目標語言: -
結果 (中文) 1: [復制]
復制成功!
激发最近参加了拿督 San Francisco,超过 1000 名数据科学家和研究人员从工业界和学术界讨论并了解数据科学、 应用的机器学习和预测应用的最新进展收集数据科学首脑会议。这里是我们的导师认为数据科学家将在未来几个月和几年使用的八个 Python 工具:SFrame 和 SGraph拿督数据科学首脑会议的最大公告之一就是,SFrame 和 SGraph 将会开放源码,可供任何人使用 BSD 许可证。SFrame (简称可扩展数据帧) 是针对内存效率和性能与像接口的综合优化磁盘备份的柱状数据结构。SGraph 具有类似的风气,但表示图有效。这两种数据结构的最大优势之一就是,他们使一名数据科学家,做"出核心"分析与不适合在内存中的数据集上的数据。这是一个分水岭时刻拿督和 Python 数据社区,作为这两个库开源信号支持开放源码 Python 生态系统在数据周围拿督的承诺。有一个常见的误解,从社会、 拿督以来企业版本,通过使用免费版本会结婚他们最终不得不支付。通过移动开放源码,显而易见的是,这种诱饵开关绝对不是拿督的目标,现在,这两个库已经打开源时,我们会希望看到其他开发人员开始采取他们使用他们自己的图书馆 (我看着你熊猫) 摆脱内存限制。散景景是一个 Python 交互式可视化库,允许您在 web 浏览器,有或没有服务器显示精心制作、 交互式的图形。它能够处理非常大或甚至流数据集 (如活谱图饲料),和是快速的嵌入性,并且可以显示悬停回调等新型可视化效果。它是有用的人想要快速、 轻松地创建交互式的情节、 仪表板和数据应用程序。散景确实非常出色的地方是在可视化与许多点的大型数据集。它是在使用这些数据集,您喜欢散景的注重表现。它还允许交互式图表和图形纯粹与 Python。目前,最互动的事情你不得不使用 Javascript — — 景是做它的方式都在 Python 中。桌子有主动性是出的内核调度程序为 Python。它可以帮助您通过您的数据分割成块,在调度计算,然而很多核心你有做大计算基于块的平行度。主动性是纯用 Python 写的并利用 Python 的生态系统,主要针对在一台机器运行的并行计算。有两种主要方法与桌子有进行交互。这样的用户将主要使用这样的集合,这是类似于流行的库,如 NumPy 和熊猫,但内部生成关系图。这样的开发商,另一方面,仍将主要为图直接。桌子有图编码算法使用 Python 字典、 元组和功能,并可用于隔离从桌子有收藏。目前有大量的 Python 生态系统中的图书馆 — — 其中许多正在经历一个连续体 — — 这似乎可能会做同样的事情。但这些库 — — 大火、 主动性和 Numba — — 而不是冲突图书馆,他们注定要在一起不同级别的数据处理工作。通过类比,你可以看作大火作为类似于查询优化器在关系数据库管理系统 (RDBMS),而桌子有可以思想的执行引擎。在这方面,大火优化符号表达式的查询或命令,而这样的可以用来优化您的硬件上的执行。宜必思酒店如果你是一名数据科学家,很可能你在日常的基础上使用 Python。但一切它是伟大的 Python 确实有它的局限性。其最大的问题之一是 Python 不能很好的扩展。它非常适合于小数据集,但对于较大的数据,需要采样或聚合和使用分布式的工具可以妥协你以各种方式的结果。讨论了实验室从一个新的项目,宜必思是一个数据分析框架,旨在提供相同的 Python 经验数据科学家和工程师已经习惯于对任何节点和数据的大小。它在功能或可用性,提供相同的互动体验和全保真分析在处理大数据规模反映了毫不妥协的单节点 Python 经验。宜必思酒店允许 100 %python 端到端用户工作流,以便与现有的 Python 数据生态 (熊猫、 Scikit 学习、 NumPy 等) 的集成。宜必思预览可供安装现在,和将扩大以包括更多的功能 — — 如与高级分析功能的集成,机器学习,和其他性能计算工具 — — 在未来。飞溅在开发 web 刮机器人时常见的问题是许多站点使用沉重的大量的 javascript 代码。Webscraping 工具很难执行 JavaScript,因此您最终常常与只有原始 HTML 和不执行的代码。飞溅,建立了由 Scrapy 创建者 ScrapingHub,是 javascript 渲染使用扭曲和 qt 离散度的 python 实现的服务。它是轻量级的 web 浏览器能够处理并行的多个页面,执行自定义 JavaScript 和关闭图像或使用 Adblock 的渲染速度 HTTP API。PetuumPetuum 是一种分布式的机器学习框架,其目的是提供一种通用算法和大规模机器学习系统接口。它提供分布式编程的工具,可以协助在规模运行机器学习的挑战。Petuum 被专为机器学习,这意味着它能利用数据关联、 泄气、 和其他统计的属性,以便最大化性能。Petuum has a number of core features: Bösen is a bounded-asynchronous distributed key-value store for data-parallel machine learning programming. It uses the Stale Synchronous Parallel consistency model, which allows asynchronous-like performance without sacrificing algorithm correctness. Another feature is Strads, a dynamic scheduler for model-parallel machine learning programming. It performs fine-grained scheduling of machine learning update operations, prioritizing computation on parts of the program that need it most while avoiding unsafe parallel operations that could hurt performance.FlinkApache Flink is an open source platform for scalable batch and stream data processing. The core of Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It’s very similar to Apache Spark, given that one of its primary goals is to serve as a replacement for MapReduce, the aging heart of Hadoop.The APIs of Spark and Flink are rather similar, but they have a few major differences in how they process data. When Spark processes a stream, it actually uses micro-batching, a fast-batch operation that works on a small part of incoming data during a unit of time. This is an approximation of stream-processing, and normally it’s fine, but it can cause problems and slowdowns in low-latency situations. Flink, on the other hand, is primarily a stream processing framework that can also do batch processing. In other words, instead of being able to do the easy job (batch processing) and an approximation of the hard one (stream processing), Flink was made to do the more difficult job, and can also handle the easier task.PyxleyWeb-based dashboards are one of the best and straightforward ways to share data science insights. But while Shiny provides a framework for data scientists working in R to build interactive web applications without having to write Javascript, HTML, or CSS, there hasn’t been an equivalent for Python. Pyxley fills that gap—it’s a Python package that simplifies the development of web applications and provides an easy way to incorporate custom Javascript, enabled through Flask, PyReact, and Pandas.Can’t get enough data science? Register for our upcoming workshop “Spark After Dark: Analytics and Machine Learning” at Galvanize’s San Francisco campus. Use promo code “MLforever” to get 25% off your ticket.来源:
正在翻譯中..
結果 (中文) 3:[復制]
復制成功!
镀锌最近出席拿督数据科学峰会在旧金山,收集的超过1000个数据科学家和来自工业界和学术界讨论和学习科学数据的最新进展,应用机器学习,和预测中的应用。

这里是八Python工具,我们的老师认为,数据科学家将在未来数月乃至数年内用:

Sframe,Sframe sgraph将要开放源代码,可用于任何一个BSD许可证是sgraph
之一最大的公告拿督数据科学峰会。Sframe(可缩放的数据帧的简称)是一个磁盘支持的柱状数据结构优化的存储效率和性能的一个数据框的界面。图有一个类似的风气,但表示图中有效。这两个数据结构的最大优势之一是,他们使数据科学家做”的核心“分析数据集,不适合在内存中的数据。

这是拿督和Python社区数据的一个分水岭,作为开源这两个图书馆信号拿督的承诺,支持开源Python周围的生态系统数据。有一个从社区常见的误解,因为拿督有企业版,用免费的版本,他们会联系,最终不得不支付。通过移动到开放源代码,很明显,这种诱饵和开关绝对不是数据的目标,现在,这两个图书馆已经开放源码,我们会希望看到其他开发者开始采用他们使用自己的图书馆(我看你熊猫)打破远离记忆的局限性。

bokeh
bokeh是Python交互式可视化库,让您显示精细,交互式图形在您的网页浏览器,或没有服务器。它可以处理非常大的甚至流数据(如现场图饲料),是快速的,可嵌入的,并可以显示新的可视化如悬停回调。对于那些想快速轻松地创建交互图,仪表盘很有用,和数据应用。

,背景虚化真的是闪耀在可视化大数据集的多点的地方。这是在使用这些数据,你欣赏的背景虚化的集中表现。这也使互动的情节和纯Python图形。目前,最具互动性的东西你必须使用JavaScript的背景虚化是一种方法来做所有的Python。

飞跑
明确一个核心的调度Python。它有助于您在大的计算上做基于块的并行性,将数据分成块,并在许多内核上进行调度。桌子是用纯Python和利用Python的生态系统,主要是针对并行计算,在单机上运行。

有互动的方式主要有两种黄昏。黄昏的用户将主要使用黄昏收藏,这是类似于流行的库如NumPy和熊猫,但内部生成图。黄昏的开发商,另一方面,将主要制造图直接。图为黄昏编码使用Python元组,算法,和功能,可用于隔离从黄昏

收藏。目前有很多图书馆在Python的系统很多都是走出连续,似乎做同样的事情。但这些图书馆的大火,桌子上面,而不是冲突和numba图书馆,他们注定要在一起工作的不同层次的数据处理。按类推,你可以把火焰作为类似于关系数据库管理系统(RDBMS)查询优化器,而黄昏可以认为是执行引擎。在这种情况下,火焰优化的符号表达式的查询或命令,而可以使用,如果你是一个数据科学家来优化您的硬件上执行。


宜必思飞跑,可能你使用的是Python在日常的基础上。但都是在杰出的,Python也有其局限性。一个最大的问题是,Python并不是很好。对于小数据集的伟大的,但需要取样或聚合大数据,并采用分布式工具可以以不同的方式妥协的结果。

新项目从Cloudera的实验室,宜必思是一个数据分析框架,旨在提供相同的python经验数据科学家和工程师能在任何节点和数据大小。它反映了单个节点的Python经验无功能或可用性的妥协,提供相同的互动体验和全保真度分析在大数据量的处理。

宜必思允许100%的Python的端到端用户的工作流程,允许集成与现有的Python数据系统(熊猫,scikit学习,NumPy,等)。宜必思预览是现在安装,和将被扩大到包括更多的功能,如集成高级分析,机器学习,和其他性能计算工具的未来。

飞溅
一个常见的问题在开发网页抓取机器人是很多网站使用大量的JavaScript。webscraping工具难以执行的JavaScript,所以你最终往往只有原始HTML和未执行的代码。飞溅的创造者,scrapinghub Scrapy建立,是一个JavaScript渲染服务,使用Python中的扭曲和QT实现的。它的轻量级Web浏览器与一个HTTP API,能够并行处理多页,执行自定义的JavaScript,关掉图像或使用AdBlock渲染速度。

petuum
petuum分布式机器学习框架,其目的是为提供一个通用算法和系统接口大规模机器学习。
正在翻譯中..
 
其它語言
本翻譯工具支援: 世界語, 中文, 丹麥文, 亞塞拜然文, 亞美尼亞文, 伊博文, 俄文, 保加利亞文, 信德文, 偵測語言, 優魯巴文, 克林貢語, 克羅埃西亞文, 冰島文, 加泰羅尼亞文, 加里西亞文, 匈牙利文, 南非柯薩文, 南非祖魯文, 卡納達文, 印尼巽他文, 印尼文, 印度古哈拉地文, 印度文, 吉爾吉斯文, 哈薩克文, 喬治亞文, 土庫曼文, 土耳其文, 塔吉克文, 塞爾維亞文, 夏威夷文, 奇切瓦文, 威爾斯文, 孟加拉文, 宿霧文, 寮文, 尼泊爾文, 巴斯克文, 布爾文, 希伯來文, 希臘文, 帕施圖文, 庫德文, 弗利然文, 德文, 意第緒文, 愛沙尼亞文, 愛爾蘭文, 拉丁文, 拉脫維亞文, 挪威文, 捷克文, 斯洛伐克文, 斯洛維尼亞文, 斯瓦希里文, 旁遮普文, 日文, 歐利亞文 (奧里雅文), 毛利文, 法文, 波士尼亞文, 波斯文, 波蘭文, 泰文, 泰盧固文, 泰米爾文, 海地克里奧文, 烏克蘭文, 烏爾都文, 烏茲別克文, 爪哇文, 瑞典文, 瑟索托文, 白俄羅斯文, 盧安達文, 盧森堡文, 科西嘉文, 立陶宛文, 索馬里文, 紹納文, 維吾爾文, 緬甸文, 繁體中文, 羅馬尼亞文, 義大利文, 芬蘭文, 苗文, 英文, 荷蘭文, 菲律賓文, 葡萄牙文, 蒙古文, 薩摩亞文, 蘇格蘭的蓋爾文, 西班牙文, 豪沙文, 越南文, 錫蘭文, 阿姆哈拉文, 阿拉伯文, 阿爾巴尼亞文, 韃靼文, 韓文, 馬來文, 馬其頓文, 馬拉加斯文, 馬拉地文, 馬拉雅拉姆文, 馬耳他文, 高棉文, 等語言的翻譯.

Copyright ©2024 I Love Translation. All reserved.

E-mail: