Galvanize recently attended the Dato Data Science Summit in San Franci的中文翻譯

Galvanize recently attended the Dat

Galvanize recently attended the Dato Data Science Summit in San Francisco, a gathering of more than 1,000 data scientists and researchers from industry and academia to discuss and learn about the most recent advances in data science, applied machine learning, and predictive applications.

Here are eight Python tools that our instructors think data scientists will be using in the coming months and years:

SFrame and SGraph
One of the biggest announcements out of the Dato Data Science Summit was that SFrame and SGraph will be going open source, available for anyone with a BSD license. SFrame (short for Scaleable Data Frame) is a disk-backed columnar data structure optimized for memory efficiency and performance with a DataFrame like interface. SGraph has a similar ethos but for representing Graphs efficiently. One of the biggest advantage of these two data structures is that they enable a data scientist to do “out of core” analytics with data on datasets that do not fit in memory.

This is a watershed moment for Dato and the Python data community, as the open sourcing of these two libraries signals Dato’s commitment to supporting an open source Python ecosystem around data. There has been a common misconception from the community, since Dato has an enterprise version, that by using the free version they’ll get tied in and end up having to pay. By moving to open source, it’s clear that this sort of bait-and-switch is definitely not Dato’s goal, and now that these two libraries have moved to open source, we’ll hopefully see other developers start adopting their use in their own libraries (I’m looking at you Pandas) to break away from memory limitations.

Bokeh is a Python interactive visualization library that lets you display elaborate, interactive graphics in your web browser, with or without a server. It’s capable of handling very large or even streaming datasets (such as a live spectrogram feed), and is fast, embeddable, and can display novel visualizations such as hover callbacks. It’s useful for anyone who wants to quickly and easily create interactive plots, dashboards, and data applications.

The places that Bokeh really shines is in visualizing large datasets with many points. It’s in working with these datasets that you appreciate Bokeh’s focus on performance. It also enables interactive plots and graphics purely with Python. Currently, for most interactive things you have to use Javascript—Bokeh is a way to do it all in Python.

Dask is an out-of-core scheduler for Python. It helps you do block-based parallelism on large computations by dividing your data up into chunks and scheduling the computation over however many cores you have. Dask is written in pure Python and leverages the Python ecosystem, primarily targeting parallel computations that run on a single machine.

There are two main ways to interact with dask. Dask users will primarily use dask collections, which are similar to popular libraries such as NumPy and Pandas, but generate graphs internally. Dask developers, on the other hand, will primarily be making graphs directly. Dask graphs encode algorithms using Python dicts, tuples, and functions, and can be used in isolation from the Dask collections.

There are currently a lot of libraries in the Python ecosystem—many of which are coming out of Continuum—that may seem to do the same thing. But these libraries—Blaze, Dask, and Numba—rather than being conflicting libraries, they’re meant to work together at different levels of data processing. By analogy, you can think of Blaze as being similar to a query optimizer in a relational database management system (RDBMS), whereas Dask can be thought of as the execution engine. In this context, Blaze optimizes the symbolic expressions of a query or command, whereas Dask can be used to optimize the execution of it on your hardware.

If you’re a data scientist, chances are you use Python on a daily basis. But for everything it’s great at, Python does have its limitations. One of its biggest problems is that Python doesn’t scale very well. It’s great for small data sets, but requires sampling or aggregations for larger data, and using distributed tools can compromise your outcome in various ways.

A new project from Cloudera Labs, Ibis is a data analysis framework that aims to provide the same Python experience data scientists and engineers are used to on any node and data size. It mirrors the single-node Python experience without a compromise in functionality or usability, delivering the same interactive experience and full-fidelity analysis while dealing at the big data scale.

Ibis allows for a 100% Python end-to-end user workflow, allowing for integration with the existing Python data ecosystem (Pandas, Scikit-learn, NumPy, etc). A preview of Ibis is available for installation now, and will be expanding to include more features—such as integration with advanced analytics, machine learning, and other performance computing tools—in the future.

A common problem when developing web-scraping bots is that many sites use a heavy amount of JavaScript. Webscraping tools have difficulty executing JavaScript, so you often end up with only the raw HTML and not the executed code. Splash, built by Scrapy creator ScrapingHub, is a javascript rendering service, implemented in Python using Twisted and QT. It’s a lightweight web browser with an HTTP API that is capable of processing multiple pages in parallel, executing custom JavaScript, and turning off images or using Adblock to render faster.

Petuum is a distributed machine learning framework that aims to provide a generic algorithmic and systems interface to large-scale machine learning. It provides distributed programming tools that can assist with the challenges of running machine learning at scale. Petuum is designed specifically for machine learning, which means that it takes advantage of data correlation, staleness, and other statistical properties to maximize performance.

Petuum has a number of core features: Bösen is a bounded-asynchronous distributed key-value store for data-parallel machine learning programming. It uses the Stale Synchronous Parallel consistency model, which allows asynchronous-like performance without sacrificing algorithm correctness. Another feature is Strads, a dynamic scheduler for model-parallel machine learning programming. It performs fine-grained scheduling of machine learning update operations, prioritizing computation on parts of the program that need it most while avoiding unsafe parallel operations that could hurt performance.

Apache Flink is an open source platform for scalable batch and stream data processing. The core of Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It’s very similar to Apache Spark, given that one of its primary goals is to serve as a replacement for MapReduce, the aging heart of Hadoop.

The APIs of Spark and Flink are rather similar, but they have a few major differences in how they process data. When Spark processes a stream, it actually uses micro-batching, a fast-batch operation that works on a small part of incoming data during a unit of time. This is an approximation of stream-processing, and normally it’s fine, but it can cause problems and slowdowns in low-latency situations. Flink, on the other hand, is primarily a stream processing framework that can also do batch processing. In other words, instead of being able to do the easy job (batch processing) and an approximation of the hard one (stream processing), Flink was made to do the more difficult job, and can also handle the easier task.

Web-based dashboards are one of the best and straightforward ways to share data science insights. But while Shiny provides a framework for data scientists working in R to build interactive web applications without having to write Javascript, HTML, or CSS, there hasn’t been an equivalent for Python. Pyxley fills that gap—it’s a Python package that simplifies the development of web applications and provides an easy way to incorporate custom Javascript, enabled through Flask, PyReact, and Pandas.

Can’t get enough data science? Register for our upcoming workshop “Spark After Dark: Analytics and Machine Learning” at Galvanize’s San Francisco campus. Use promo code “MLforever” to get 25% off your ticket.

原始語言: -
目標語言: -
結果 (中文) 1: [復制]
激发最近参加了拿督 San Francisco,超过 1000 名数据科学家和研究人员从工业界和学术界讨论并了解数据科学、 应用的机器学习和预测应用的最新进展收集数据科学首脑会议。这里是我们的导师认为数据科学家将在未来几个月和几年使用的八个 Python 工具:SFrame 和 SGraph拿督数据科学首脑会议的最大公告之一就是,SFrame 和 SGraph 将会开放源码,可供任何人使用 BSD 许可证。SFrame (简称可扩展数据帧) 是针对内存效率和性能与像接口的综合优化磁盘备份的柱状数据结构。SGraph 具有类似的风气,但表示图有效。这两种数据结构的最大优势之一就是,他们使一名数据科学家,做"出核心"分析与不适合在内存中的数据集上的数据。这是一个分水岭时刻拿督和 Python 数据社区,作为这两个库开源信号支持开放源码 Python 生态系统在数据周围拿督的承诺。有一个常见的误解,从社会、 拿督以来企业版本,通过使用免费版本会结婚他们最终不得不支付。通过移动开放源码,显而易见的是,这种诱饵开关绝对不是拿督的目标,现在,这两个库已经打开源时,我们会希望看到其他开发人员开始采取他们使用他们自己的图书馆 (我看着你熊猫) 摆脱内存限制。散景景是一个 Python 交互式可视化库,允许您在 web 浏览器,有或没有服务器显示精心制作、 交互式的图形。它能够处理非常大或甚至流数据集 (如活谱图饲料),和是快速的嵌入性,并且可以显示悬停回调等新型可视化效果。它是有用的人想要快速、 轻松地创建交互式的情节、 仪表板和数据应用程序。散景确实非常出色的地方是在可视化与许多点的大型数据集。它是在使用这些数据集,您喜欢散景的注重表现。它还允许交互式图表和图形纯粹与 Python。目前,最互动的事情你不得不使用 Javascript — — 景是做它的方式都在 Python 中。桌子有主动性是出的内核调度程序为 Python。它可以帮助您通过您的数据分割成块,在调度计算,然而很多核心你有做大计算基于块的平行度。主动性是纯用 Python 写的并利用 Python 的生态系统,主要针对在一台机器运行的并行计算。有两种主要方法与桌子有进行交互。这样的用户将主要使用这样的集合,这是类似于流行的库,如 NumPy 和熊猫,但内部生成关系图。这样的开发商,另一方面,仍将主要为图直接。桌子有图编码算法使用 Python 字典、 元组和功能,并可用于隔离从桌子有收藏。目前有大量的 Python 生态系统中的图书馆 — — 其中许多正在经历一个连续体 — — 这似乎可能会做同样的事情。但这些库 — — 大火、 主动性和 Numba — — 而不是冲突图书馆,他们注定要在一起不同级别的数据处理工作。通过类比,你可以看作大火作为类似于查询优化器在关系数据库管理系统 (RDBMS),而桌子有可以思想的执行引擎。在这方面,大火优化符号表达式的查询或命令,而这样的可以用来优化您的硬件上的执行。宜必思酒店如果你是一名数据科学家,很可能你在日常的基础上使用 Python。但一切它是伟大的 Python 确实有它的局限性。其最大的问题之一是 Python 不能很好的扩展。它非常适合于小数据集,但对于较大的数据,需要采样或聚合和使用分布式的工具可以妥协你以各种方式的结果。讨论了实验室从一个新的项目,宜必思是一个数据分析框架,旨在提供相同的 Python 经验数据科学家和工程师已经习惯于对任何节点和数据的大小。它在功能或可用性,提供相同的互动体验和全保真分析在处理大数据规模反映了毫不妥协的单节点 Python 经验。宜必思酒店允许 100 %python 端到端用户工作流,以便与现有的 Python 数据生态 (熊猫、 Scikit 学习、 NumPy 等) 的集成。宜必思预览可供安装现在,和将扩大以包括更多的功能 — — 如与高级分析功能的集成,机器学习,和其他性能计算工具 — — 在未来。飞溅在开发 web 刮机器人时常见的问题是许多站点使用沉重的大量的 javascript 代码。Webscraping 工具很难执行 JavaScript,因此您最终常常与只有原始 HTML 和不执行的代码。飞溅,建立了由 Scrapy 创建者 ScrapingHub,是 javascript 渲染使用扭曲和 qt 离散度的 python 实现的服务。它是轻量级的 web 浏览器能够处理并行的多个页面,执行自定义 JavaScript 和关闭图像或使用 Adblock 的渲染速度 HTTP API。PetuumPetuum 是一种分布式的机器学习框架,其目的是提供一种通用算法和大规模机器学习系统接口。它提供分布式编程的工具,可以协助在规模运行机器学习的挑战。Petuum 被专为机器学习,这意味着它能利用数据关联、 泄气、 和其他统计的属性,以便最大化性能。Petuum 有大量的核心功能: Bösen 是有界异步分布式的键 / 值存储的数据并行机学习编程。它使用陈旧同步并行的一致性模型,允许异步样的性能而不会牺牲算法正确性。另一个特点是斯特拉,动态调度模型并行机器学习编程器。它执行细粒度调度机器学习更新操作,优先考虑计算程序,同时避免不安全的并行操作,可能会损害性能最需要它的部分。闪变Apache 闪变是可伸缩的批处理和流数据处理开放源码平台。闪变的核心是一个流媒体的数据流引擎,数据流为分布式计算提供数据分布、 通信和容错能力。它是非常类似于 Apache 火花,考虑到其主要目标之一是作为替换为 MapReduce,老化的 Hadoop 的心。Api 的火花和闪变的颇为相似,但他们有几个主要区别在他们如何处理数据。当火花流进行处理时,它实际上使用微配料,快速批处理操作,在单位时间对传入的数据的一小部分工作。这是一种近似的流处理,和通常是好的但它可以在低延迟的情况下导致问题和减速。闪变,另一方面,是主要流处理框架,也可以做批量处理。换句话说,非但不能做较简单的工作 (批处理) 和硬逼近 (流处理),闪变了去做更困难的工作,也可以处理的更容易的任务。Pyxley基于 web 的仪表板是共享数据的科学见解的最佳和简单方法之一。但同时闪亮在 R 来构建交互式 web 应用程序而无需编写 Javascript、 HTML 或 CSS 数据科学家提供了一个框架,没有一个相等于 Python。Pyxley 来填补这个空白 — — 它是一个 Python 包,它简化了 web 应用程序的开发和提供简便的方法来将自定义 Javascript,启用通过瓶、 PyReact 和熊猫。不能获得足够的数据科学吗?注册为镀锌的 San Francisco 校园我们即将车间"火花后暗: 分析和机器学习"。使用促销代码"MLforever"来获取 25%的折扣机票。来源:
結果 (中文) 3:[復制]


Sframe,Sframe sgraph将要开放源代码,可用于任何一个BSD许可证是sgraph










一个常见的问题在开发网页抓取机器人是很多网站使用大量的JavaScript。webscraping工具难以执行的JavaScript,所以你最终往往只有原始HTML和未执行的代码。飞溅的创造者,scrapinghub Scrapy建立,是一个JavaScript渲染服务,使用Python中的扭曲和QT实现的。它的轻量级Web浏览器与一个HTTP API,能够并行处理多页,执行自定义的JavaScript,关掉图像或使用AdBlock渲染速度。





Web仪表盘为主是一个最简单的方式分享科学数据的见解。但在闪亮的提供了一个数据科学家的工作中不必写JavaScript,创建交互式Web应用程序的HTML,CSS或框架,没有Python等效。pyxley填补了这一空白的,简化了Web应用程序的开发提供了一种简单的方法将自定义JavaScript Python包,使通过摇瓶,pyreact,和大熊猫,不能获得足够的数据科学吗?注册为我们即将举行的研讨会“火花后暗:分析和机器学习在镀锌的三藩校区。使用促销代码“”mlforever获得25%的折扣机票到了。

来源:< HTTP:/ / www.galvanize。COM /博客/ 2015年/07/31/8工具显示什么在地平线上的Python数据生态系统/#。>vb2bajgueso
本翻譯工具支援: 世界語, 中文, 丹麥文, 亞塞拜然文, 亞美尼亞文, 伊博文, 俄文, 保加利亞文, 信德文, 偵測語言, 優魯巴文, 克林貢語, 克羅埃西亞文, 冰島文, 加泰羅尼亞文, 加里西亞文, 匈牙利文, 南非柯薩文, 南非祖魯文, 卡納達文, 印尼巽他文, 印尼文, 印度古哈拉地文, 印度文, 吉爾吉斯文, 哈薩克文, 喬治亞文, 土庫曼文, 土耳其文, 塔吉克文, 塞爾維亞文, 夏威夷文, 奇切瓦文, 威爾斯文, 孟加拉文, 宿霧文, 寮文, 尼泊爾文, 巴斯克文, 布爾文, 希伯來文, 希臘文, 帕施圖文, 庫德文, 弗利然文, 德文, 意第緒文, 愛沙尼亞文, 愛爾蘭文, 拉丁文, 拉脫維亞文, 挪威文, 捷克文, 斯洛伐克文, 斯洛維尼亞文, 斯瓦希里文, 旁遮普文, 日文, 歐利亞文 (奧里雅文), 毛利文, 法文, 波士尼亞文, 波斯文, 波蘭文, 泰文, 泰盧固文, 泰米爾文, 海地克里奧文, 烏克蘭文, 烏爾都文, 烏茲別克文, 爪哇文, 瑞典文, 瑟索托文, 白俄羅斯文, 盧安達文, 盧森堡文, 科西嘉文, 立陶宛文, 索馬里文, 紹納文, 維吾爾文, 緬甸文, 繁體中文, 羅馬尼亞文, 義大利文, 芬蘭文, 苗文, 英文, 荷蘭文, 菲律賓文, 葡萄牙文, 蒙古文, 薩摩亞文, 蘇格蘭的蓋爾文, 西班牙文, 豪沙文, 越南文, 錫蘭文, 阿姆哈拉文, 阿拉伯文, 阿爾巴尼亞文, 韃靼文, 韓文, 馬來文, 馬其頓文, 馬拉加斯文, 馬拉地文, 馬拉雅拉姆文, 馬耳他文, 高棉文, 等語言的翻譯.

Copyright ©2025 I Love Translation. All reserved.
